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ABSTRACT 


NEURAL  NETWORK  DESIGN 

AND  THE 

COMPLEXITY  OF  LEARNING 


SEPTEMBER  1988 

J.  STEPHEN  JUDD,  B.SC.,  UNIVERSITY  OF  MANITOBA 
M.SC.,  UNIVERSITY  OF  MANITOBA 
PH.D.,  UNIVERSITY  OF  MASSACHUSETTS 


Directed  by:  Professor  Andrew  G.  Barto 


We  formalize  a  notion  of  learning  that  characterizes  the  training  of  feed-forward 
networks.  In  the  field  of  learning  theory,  it  stands  as  a  new  model  specialized  for 
the  type  of  learning  problems  that  arise  in  connectionist  networks.  The  formulation 
is  similar  to  Valiant’s  [Vai84j  in  that  we  ask  what  can  be  feasibly  learned  from 
examples  and  stored  in  a  particular  data  structure. 

One  can  view  the  data  structure  resulting  from  Valiant-type  learning  as  a  ‘sen¬ 
tence'  in  a  language  described  by  grammatical  syntax  rules.  Neither  the  words  nor 
their  interrelationships  are  known  a  priori.  Our  learned  data  structure  is  more  par- 
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ticular  than  Valiant’s  in  that  it  must  be  a  particular  ‘sentence’.  The  position  and 
relationships  of  each  ‘word’  are  fully  specified  in  advance,  and  the  learning  system 
need  only  discover  what  the  missing  words  are.  This  corresponds  to  the  problem  of 
finding  retrieval  functions  for  each  node  in  a  given  network. 

We  prove  this  problem  NP- complete  and  thus  demonstrate  that  learning  in  net¬ 
works  has  no  efficient  general  solution.  Corollaries  to  the  main  theorem  demonstrate 
the  iVP-completeness  of  several  sub-cases.  While  the  intractability  of  the  problem 
precludes  its  solution  in  all  these  cases,  we  sketch  some  alternative  definitions  of 
the  problem  in  a  search  for  tractable  sub-cases. 

One  broad  class  of  subcases  is  formed  by  placing  constraints  on  the  network 
architecture;  we  study  one  type  in  particular.  The  focus  of  these  constraints  is  on 
families  of  ‘shallow’  architectures  which  are  defined  to  have  bounded  depth  and 
unbounded  width.  We  introduce  a  perspective  on  shallow  networks,  called  the 
Support  Cone  Interaction  (SCI)  graph,  which  is  helpful  in  distinguishing  tractable 
from  intractable  subcases:  When  the  SCI  graph  has  tree-width  O(logn),  learning 
can  be  accomplished  in  polynomial  time;  when  its  tree-width  is  we  find  the 

problem  iVP-complete  even  if  the  SCI  graph  is  a  simple  2-dimensional  planar  grid. 
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Chapter  1 

INTRODUCTION 


Drawing  inspiration  from  neuroanatomy  and  spurred  on  by  successes  in  modelling 
cognitive  phenomena,  the  connectionist  model  of  computation  has  recently  drawn 
much  attention  (see  for  example  the  landmark  volumes  [RM86,MR86,AR88]).  Con- 
nectionist  networks  are  also  called  neural  networks.  This  model  is  used  in  the  study 
of  how  knowledge  might  be  captured,  represented,  and  processed  by  circuits  that 
are  similar  in  an  abstract  sense  to  biological  computers.  A  neural  network  is  char¬ 
acterized  by  its  emphasis  on  using  many  richly  interconnected  processors  that  per¬ 
form  relatively  slow  and  simple  calculations  in  parallel.  The  connectionist  approach 
shows  promise  of  eventually  providing  a  new  language  for  designing  and  building 
computational  devices,  and  possibly  may  yield  clues  to  the  centuries-old  puzzle  of 
brain  function.  Many  aspects  of  connectionist  networks,  including  structural  de¬ 
sign,  I/O  protocol,  and  behavioural  phenomena  have  been  compared  to  biological 
brains. 

The  model  is  loosely  defined  around  three  aspects:  computing  units,  commu¬ 
nication  links,  and  message  types.  The  computing  units  are  small,  homogeneous, 
plentiful,  simple  and  can  accept  many  input  connections.  These  units  are  connected 
into  networks  by  dedicated  low-bandwidth  links.  These  communication  links  are 
also  considered  to  be  cheap  and  plentiful — an  attitude  that  is  a  response  to  neu- 
roanatomical  observations  that  individual  neurons  may  have  extremely  high  fan-in. 
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As  vet,  there  seem  to  be  few  principles  or  methodologies  for  designing  the  spe¬ 
cie.  connectivity  patterns  in  these  networks.  To  the  best  of  our  knowledge,  all 
network  designs  in  the  literature  have  been  rather  ad  hoc  constructions  for  spe¬ 
cific  experiments.  In  our  view,  this  is  a  major  inadequacy  of  the  discipline.  The 
discovery  of  well-grounded  and  universal  design  principles  would  not  only  assist 
the  development  of  artificial  neural  networks  but  would  aiso  strengthen  links  to 
neuroanatomy:  hopefully  neuroanatomists  could  confirm  or  repudiate  the  ideas  by 
examining  biological  brain  structure. 

Some  sources  of  design  constraints  arise  from  consideration  of 

•  learning  speed 

•  signal  integrity  (error  correction) 

•  processing  integrity  (fault  tolerance) 

•  retrieval  speed 

•  memory  capacity 

•  signalling  capacity  (bandwidth) 

•  3-dimensional  geometry 

•  power  supplies  and  heat  dissipation 

A  thorough  theoretical  understanding  of  these  areas  would  advance  the  field  of 

artificial  neural  networks. 

We  propose  to  focus  on  the  first  item  in  this  list. 

1.1  Learning 

We  think  of  learning  as  the  capacity  of  a  network  to  absorb  information  from  its  envi¬ 
ronment  without  requiring  some  external  intelligent  agent  to  ‘program  it.  Learning 
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Figure  1.1:  A  Simple  Model  of  Learning.  Note  the  conceptual  separation  of 
the  system  into  two  processes  (shown  in  rectangles).  This  separation  does 
not  correspond  to  any  physical  separation  in  a  network.  Typically,  each 
node  serves  as  a  repository  for  a  piece  of  the  memory  and  participates  in 
both  processes  that  interact  with  memory. 


is  a  quintessential  ability  of  brains,  and  it  is  a  major  focus  of  much  connectionist 
research.  Unfortunately,  the  learning  algorithms  reported  in  the  literature  so  far 
are  all  unacceptably  slow  in  large  networks.  Although  it  is  clear  that  we  need  to 
be  able  to  scale  up  our  applications  to  much  bigger  networks,  it  is  not  at  all  clear 
how  to  achieve  this.  Many  researchers  view  this  as  the  most  pressing  challenge  for 
current  connectionist  research. 

The  networks  have  two  modes— the  so-called  ‘learning’  or  loading  mode  wherein 
data  are  loaded  into  the  permanent  memory  base,  and  the  ‘retrieval’  mode  wherein 
those  associative  data  are  recalled  from  memory.  Figure  1.1  depicts  the  general 
paradigm.  During  retrieval,  each  computing  unit  calculates  an  output  value 
by  some  simple  rule  such  as  a  threshold  function  on  the  linear  weighted  sum  of 
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its  current  inputs.  The  computation  is  performed  repetitively  at  approximately 
the  same  cycle  rate  as  all  ether  units.  Hence  the  typical  signal  transmitted  via  a 
connection  is  a  single  logical  value  or  perhaps  a  scalar  value.  However,  the  network 
as  a  whole  is  expected  to  do  such  things  as  associate  pairs  of  bit  patterns  or  find 
completions  of  partial  patterns. 

These  comments  apply  to  all  types  of  connectionist  models  and  they  also  seem 
to  describe  standard  circuit  models  of  computation.  However,  connectionist  devices 
are  often  elaborated  with  various  features  like  bi-directional  connections,  learning 
capability,  stochasticity,  linear  sum  functions,  or  cyclic  dynamics.  For  simplicity, 
this  thesis  discusses  only  those  networks  that  retrieve  data  in  the  manner  of  a  strictly 
unidirectional  feed-forward  deterministic  circuit.  We  assume  that  the  networks 
have  some  means  of  changing  their  behaviour  but  that  this  change  does  not  involve 
altering  their  connectivity  structure. 

The  implicit  goal  of  connectionist  learning  research  has  been  to  find  a  single 
‘learning  rule’  that  each  network  unit  can  follow  in  order  to  adjust  the  weights 
used  in  its  linear  threshold  functions  in  such  a  way  that  the  retrieval  behaviour  of 
the  whole  network  is  eventually  correct.  It  was  hoped  that  a  learning  rule  would 
work  for  any  network  design.  Many  researchers  have  developed  candidates  for  such 
a  learning  algorithm;  some  notable  approaches  are  the  Perceptron  [Ros6I,MP72], 
back-propagation  [RHW86,Par85,lC85],  Boltzmann  [AHS85,HS86j,  and  associative 
reward-penalty  (Ap-p)  [BA85,Bar85]  schemes. 

There  is  a  theorem  proving  the  effectiveness  of  the  Perceptron  Learning  Rule 
for  linearly  separable  tasks  in  a  single  layer  of  trainable  nodes.  In  their  book,  Min¬ 
sky  and  Papert  studied  this  learning  rule  and  also  investigated  several  computing 
properties  of  1-  and  ,2-layer  networks.  But  one  of  the  tantalizing  gaps  that  Minsky 
and  Papert  left  regards  the  learning  problem  in  mu/H-layered  networks.  They  con¬ 
sidered  it  an  important  research  problem  to  extend  results  on  learning  algorithms 
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for  single-layer  nets  to  the  case  of  multi-layer  nets: 

‘‘Perhaps  some  powerful  convergence  theorem  will  be  discovered,  or  some 
profound  reason  for  the  failure  to  produce  an  interesting  learning  theo¬ 
rem  for  the  muiti-layered  machine  will  be  found.” 

Descriptions  of  the  back-propagation,  Boltzmann,  and  AR_P  methods  have  each 
been  published  along  with  demonstrations  of  their  ability  on  selected  associative 
learning  problems  and  their  required  learning  time  has  been  studied  empirically 
(see  Chapter  3).  However,  no  proof  of  their  effectiveness  has  been  offered  and  no 
analytical  treatment  of  their  scale-up  properties  has  appeared.  The  published  suc¬ 
cesses  in  connectionist  learning  have  been  empirical  results  for  very  small  networks, 
typically  much  less  than  100  nodes.  To  fully  exploit  the  expressive  power  of  networks 
they  need  to  be  scaled  up  to  much  bigger  sizes,  but  it  is  widely  acknowledged  that 
as  the  networks  get  larger  and  deeper  the  amount  of  time  required  for  them  to  load 
the  training  data  grows  prohibitively  [HV86,TJ88,Bar82,Omo87j.  It  is  important 
to  find  out  how  to  avoid  this  phenomenon. 

The  connectionist  learning  problem  is  treated  here  first  of  all  as  simple  mem¬ 
orization  of  some  given  data  by  a  given  feed-forward  network.  This  problem  is 
described  and  discussed  in  Chapter  2.  We  ask  if  there  exists  an  efficient  algorithm 
for  solving  this  learning  problem.  ‘Efficient’  is  taken  to  mean  that  the  worst-case 
learning  time  for  a  network  of  size  n  should  be  bounded  above  by  a  polynomial 
in  n,  something  which  can  easily  be  proved  by  exhibiting  an  algorithm  for  it.  An 
excellent  theoretical  test  is  available  which  indicates  intractability  in  a  problem, 
and  that  is  to  prove  the  problem  to  be  NP-complete.  We  will  explain  and  use  this 

tool  in  Chapter  4. 

Are  there  efficient  algorithms  for  learning  in  large  connectionist  networks?  Or  is 
there  some  deep  reason  why  there  cannot  be?  Does  network  design  affect  learning 
ability?  How  does  learning  time  scale  up  with  network  size?  Can  scale-up  properties 
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be  manipulated  through  design  techniques?  This  thesis  addresses  such  questions. 

1.2  Approach 

We  seek  design  principles  by  appealing  to  constraints  of  learnability.  As  is  often  the 
case  in  theoretical  pursuits,  it  is  easiest  to  investigate  extreme  cases  first,  in  order 
to  find  the  boundary  conditions  where  the  problem  is  certifiably  easy  or  certifiably 
infeasible,  and  later  to  refine  the  middle  ground. 

This  thesis  begins  by  identifying  and  formalizing  a  model  of  the  computational 
problem  involved  in  getting  a  network  to  memorize  data.  The  particular  formu¬ 
lation  we  use  is  closely  related  to  the  types  of  experiments  being  reported  in  the 
connectionist  literature.  It  then  uses  the  model  to  make  two  important  points: 

1.  The  learning  problem  in  its  general  form  is  too  difficult  to  solve.  By  proving 
it  to  be  NP-complete,  we  can  claim  that  large  instances  of  the  problem  would 
be  wildly  impractical  to  solve,  (See  Chapter  4).  There  is  no  reliable  method 
to  configure  a  given  arbitrary  network  to  remember  a  given  arbitrary  body  of 
data  in  a  reasonable  amount  of  time. 

This  result  shows  that  the  simple  problem  of  remembering  a  list  of  data  items 
(something  that  is  trivial  in  a  classical  random-access  machine)  is  extremely  difficult 
to  perform  in  some  fixed  networks. 

Of  course,  Connectionists  would  not  be  satisfied  if  all  they  got  out  of  their 
systems  was  rote  memory.  Much  of  the  fascination  of  Neural  Networks  comes  from 
the  possibility  of  their  having  generalization  properties  which  could  be  employed  to 
extend  data,  smooth  over  the  domain,  and  induce  the  structure  of  the  underlying 
data.  Only  so  would  they  achieve  compact  representations,  fast  calculations,  strong 
prediction,  and  intelligent  learning.  We  will  argue  that  success  in  generalizing 
presupposes  the  ability  to  memorize  simple  associative  data  faithfully  and  efficiently. 
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The  intractability  of  memorization  suggests  that  the  connectionist  model,  even 
though  it  has  demonstrated  many  attractive  qualities,  may  have  a  crucial  flaw.  This 
might  well  be  a  disturbing  theorem  were  it  not  for  other  insights  that  accompany 
it: 

2.  There  are  many  ways  to  circumvent  this  negative  result,  and  each  one  cor¬ 
responds  to  a  particular  constraint  on  the  learning  problem.  There  are  fast 
learning  algorithms  for  cases  where  the  network  is  of  a  very  restricted  design, 
or  where  the  data  to  be  loaded  are  very  simple. 

These  two  observations  (the  full  problem  is  too  hard;  some  sub-cases  are  easy),  pro¬ 
vide  a  foundation  for  theoretical  inquiries  into  the  design  of  connectionist  networks. 
There  are  various  ways  to  constrain  the  loading  problem  to  find  subcases  that  are 
solvable  in  polynomial  time:  by  restricting  the  task  to  be  learned,  by  restricting  the 
architecture  of  the  net,  by  relaxing  the  criterion  of  success,  etc.,  or  by  combinations 
of  these.  The  very  general  hard  cases  and  the  very  restricted  easy  cases  establish 
extrema  within  which  a  more  complete  theory  can  be  constructed.  This  thesis  pro¬ 
motes  the  usefulness  of  elaborating  such  a  theory,  and  will  consider  a  few  special 
cases  within  the  great  variety  of  imaginable  subcases. 

1.3  Subcases 

In  Chapter  5  we  discuss  various  ways  of  formulating  sub-cases  or  simply  different 
cases  that  might  be  feasibly  solvable.  Even  in  several  of  these  restricted  sub-cases 
the  intractability  remains,  thus  revealing  a  labyrinth  of  open  and  closed  avenues  for 
discovering  what  it  is  that  large  connectionist  networks  can  or  cannot  learn.  What 
follows  here  is  a  description  of  the  major  theaters  in  which  constraining  conditions 
can  be  posed. 
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1.3.1  Data  to  be  Learned 

By  putting  strong  constraints  on  what  the  network  is  required  to  learn,  some  trivial 
(and  uninteresting)  learning  problems  arise.  It  is  our  desire  to  know  if  there  are 
some  interesting  classes  of  learnable  tasks.  We  prove  it  intractable  for  networks  to 
learn  even  very  small  numbers  of  associated  pairs,  or  to  learn  sets  of  pairs  that  are 
drawn  from  a  monotonic  function. 

1.3.2  Network  Design 

By  putting  strong  constraints  on  the  type  of  network  used  for  learning,  some  trivial 
learning  problems  arise.  These  networks  may  all  be  next  to  useless,  but  our  main 
theorem  shows  that  if  we  allow  the  network  to  be  of  arbitrary  design  then  the 
learning  problem  is  too  hard.  The  challenge  is  to  see  if  there  are  any  intermediate 
network  designs  that  are  useful  and  can  learn  easily.  Unknown  cases  include  very 
deep  nets  and  highly  connected  nets.  For  various  reasons,  we  pursue  studying  the 
loading  problem  in  one  particular  broad  architectural  family  which  we  call  shallow 
architectures.  This  family  has  a  technical  definition  that  effectively  limits  the  depth 
of  each  network  but  does  not  limit  the  width.  This  family  is  interesting  because  it 
allows  us  to  study  the  load-time  scale-up  issue  without  having  to  deal  with  issues 
that  arise  in  deep  networks.  The  connectionist  literature  uniformly  reports  great 
difficulties  in  loading  deep  nets  so  we  have  taken  the  strategic  decision  to  avoid  the 
issue  altogether  and  concentrate  on  shallow  nets.  NP-completeness  appears  even  in 
networks  of  depth  2  so.  there  is  still  a  considerable  domain  of  issues  to  explore  even 
in  the  shallow  case.  Furthermore,  the  shallow  architectures  are  interesting  because 
they  might  be  a  useful  model  of  some  brain  structures. 

For  the  discussion  of  shallow  architectures,  we  introduce  the  notion  of  a  sup¬ 
port  cone,  which  is  the  set  of  all  nodes  that  can  affect  the  behaviour  of  an  output 
node.  Then  we  define  a  Support  Cone  Interaction  (SCI)  graph,  which  captures 


8 


how  the  support  cones  overlap  with  each  other.  When  this  SCI  graph  is  a  planar, 
2-dimensional  grid  the  loading  problem  is  still  NP-complete,  but  if  the  SCI  graph 
has  limited  tree-width  then  the  architecture  can  be  loaded  in  polynomial  time. 
Tree-width  is  a  metric  on  graphs  that  is  a  generalization  of  the  more  widely  known 
graph-theoretic  notion  of  bandwidth. 

1.3.3  Node  Functionality 

A  third  way  of  constraining  the  learning  model  is  to  imbue  the  network  nodes  with 
different  amounts  of  functionality.  The  standard  node  type  used  in  connectionist 
research  is  the  linear  sum  type — capable  of  performing  any  linearly  separable  binary 
function.  The  NP- completeness  found  in  our  main  theorem  applies  to  this  case,  but 
we  go  further  to  prove  that  even  when  the  nodes  are  capable  of  performing  much 
more  complex  functions  (e.g.  arbitrary  Boolean  functions),  or  when  the  nodes  are 
capable  of  performing  only  extremely  simple  functions,  the  computational  problem 
is  much  the  same.  We  had  hoped  that  the  theory  might  guide  us  in  selecting  appro¬ 
priate  types  of  nodes  (e.g.  by  somehow  demonstrating  that  the  linearly  separable 
functions  are  a  logical  or  optimal  choice).  But  the  results  are  quite  equivocal  on  this 
matter.  Subsequent  work  by  Blum  and  Rivest  [BR88]  suggests  that  the  linear  sum 
functions  actually  introduce  special  computational  problems  that  could  be  avoided 
with  simpler  functions  or  with  more  complex  functions. 

Our  complexity  results  are  almost  entirely  independent  of  the  type  of  node 
functions  used  in  the  networks.  This  is  a  strength  in  itself.  But  it  offers  a  further 
conclusion:  that  the  whole  issue  of  node  functionality  is  of  secondary  importance 
to  learning  complexity,  even  though  significant  research  effort  is  now  being  spent 
on  analyzing  the  particularities  of  one  or  two  particular  favourite  types. 
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Philosophical  Base 


'Liiis  o;udy  is  based  on  the  belief  that  the  s:ale-up  aspect  of  the  learning  issue  is  a 
riclt  scurce  of  imperatives  for  network  design  and  that  the  development  of  a  theory 
of  learning  is  therefore  well  warranted.  We  posit  that  a  thorough  delineation  of  the 
polynomial-time  solvable  cases  from  the  iVP-complete  cases  will  illuminate  design 
constraints  that  all  networks  must  adhere  to  in  order  to  be  capable  of  learning. 
Specifically,  we  posit  that  an  understanding  of  the  roots  of  NP- completeness  in 
connectionist  learning  will  yield  techniques  for  building  architectures  that  are  easy 
to  load. 

We  think  that  the  general  computational  question  framed  here  is  a  basis  that 
could  ultimately  lead  to  a  collection  of  definitions  of  tasks,  architectural  designs, 
and  loading  criteria,  and  to  a  theory  of  how  these  various  aspects  interact  to  create 
feasible  or  infeasible  learning.  This  thesis  is  a  beginning  toward  such  a  goal. 


1.5  Outline 

The  next  chapter  formulates  the  general  learning  problem  and  sets  up  the  formal 
question  regarding  its  complexity. 

Chapter  3  reviews  some  of  the  current  theory  on  learning  in  general,  and  relates 
our  model  of  learning  to  other  models  outside  of  the  connectionist  paradigm  that 
also  deal  with  learning  from  examples.  It  also  examines  what  is  currently  known 
about  scale-up  issues  in  connectionist  learning. 

Chapter  4  proves  our  main  theorem,  which  deals  with  the  intractability  of  the 
general  case.  Section  4.2  reports  that  the  complexity  is  invariant  for  almost  any  kind 

of  node  functionality.  The  following  chapter  (5)  elaborates  some  of  the  implications 

/ 

of  the  main  theorem  and  points  out  several  corollaries  applying  to  various  special 
subcases. 
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Chapter  6  proves  the  tractabiiUy  and  intractability  of  various  families  of  shallow 
networks. 

Chapter  7  responds  to  some  concerns  about  how  well  neural  networks  would  be 
able  to  generalize  from  what  they  have  learned  to  other  parts  of  their  domains  that 
they  have  not  had  access  to.  Many  people  have  high  hopes  for  the  abilities  of  neural 
networks  to  perform  such  generalization.  Using  Valiant’s  technical  definition  of 
induction,  we  show  how  the  intractability  of  memorization  implies  the  intractability 
of  generalization  as  well. 

Finally,  Chapter  8  summarizes  our  results  and  discusses  some  implications  and 
extensions  to  the  work. 
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Chapter  2 


OUR  MODEL  OF  NETWORK 

LEARNING 


2.1  The  Learning  Protocol 

The  type  of  learning  investigated,  here  is  known  as  supervised  learning.  In  this 
paradigm  input  patterns  (called  stimuli)  are  presented  to  a  machine  paired  with 
their  desired  output  patterns  (called  responses).  The  object  of  the  learning  machine 
is  to  remember  all  the  associations  presented  during  a  training  phase  so  that  in 
future  tests  the  machine  will  be  able  to  emit  the  associated  response  for  any  given 
stimulus.  This  interaction  is  diagrammed  in  Figure  2.1. 

The  exact  form  of  presentation  of  these  data  is  not  of  concern  here.  Many 
connectionist  experiments  involve  a  long  series  of  training  samples  wherein  a  single 
associative  pair  is  presented  to  the  network  at  a  time,  and  where  any  particular  pair 
may  have  to  be  presented  many  times  over.  But  none  of  these  details  are  relevant 
here,  and  our  results  are  strengthened  by  abstracting  away  from  them.  We  require 
only  that  the  associative  data  are  available  in  some  reasonable  encoding. 

In  what  follows,  every  stimulus  o  is  a  fixed-length  string  of  s  bits,  and  every 
response  p  is  a  string  of  r  bits  with  don  t  cares  ,  that  is,  cx  €E  {0, 1}  and  p  G 
(0, 1,  *}r.  The  output  from  a  net  is  an  element  of  {0,  l}r.  The  purpose  of  a  response 
string  is  to  specify  constraints  on  what  a  particular  output  can  be:  We  say  that 


12 


Figure  2.1:  A  Model  of  Supervised  Learning.  The  loading  process  examines 
the  SR  items  and  alters  memory  to  store  that  data.  Later,  the  retrieval 
process  accepts  a  stimulus  and  examines  memory  to  find  and  emit  the  as¬ 
sociated  response. 


an  output  string,  9,  agrees  with  a  response  string,  p,  if  cacti  hr-.,  of  ••he  output 
equals  the  corresponding  bit,  pt,  of  the  response  whenever  p,  *'  l.v ,  i;  The  notation 
for  such  agreement  is  9  (=  p.  Each  stimulus/response  pair,  is  called  an  SR 

xtem.  A  task  is  a  set  of  SR  items  that  the  machine  is  required  to  learn.  To  be 
reasonable,  each  distinct  stimulus  in  a  task  should  be  associated  with  no  more  than 
one  distinct  response.  Equivalently,  a  task  T  should  be  extendible  to  some  function 
/  •  {0,  l}1’  — ►  (0,  l}r.  We  view  functions  as  sets  of  ordered  pairs  and  use  the  notation 

T  C  /  to  mean  T  C  {{o,p)  :  f[a)  \=  p}- 

2.2  Network  Architecture 

The  particular  style  of  connectionist  machines  considered  here  is  that  of  non¬ 
recurrent,  or  feed-forward,  networks  of  computing  elements.  This  is  a  generalized 
combinational  circuit;  the  connections  between  nodes  form  a  directed  acyclic  graph, 
and  the  nodes  perform  some  function  of  their  inputs  as  calculated  by  previous  nodes 

in  the  graph. 

We  define  an  architecture  as  a  5-tuple  A  =  ( P,V,S,R,E )  where 
P  is  a  set  of  posts, 

V  is  a  set  of  n  nodes:  V  =  {tq,  t>2>  •  •  ■ » vn}  Q 
S  is  a  set  of  s  input  posts:  S  =  P  —  V , 

R  is  a  set  of  r  output  posts:  R  C  P,  and 

E  is  a  set  of  directed  edges:  E  C  G  P,  Vj  G  V,  i  <  j} 

The  constraints  on  the  edges  ensure  that  no  cycles  occur  in  the  graph.  Denote  the 
set  of  input  posts  to  node  vk  as  pre(vk)  =  {v3  :  {vj,vk)  G  E}.  The  size  of  this  set 
(denoted  |pre(u^)])  is  called  the  fan-m. 

An  architecture  specifies  everything  about  a  circuit  except  what  kind  of  functions 
the  nodes  perform  (i.e.  what  kind  of  gates  they  are). 
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2.3  Node  Functions 


Each  node  in  a  network  contributes  to  the  overali  retrieval  computation  by  taking 
signals  from  its  input  edges  and  computing  an  output  signal.  Although  some  of 
our  results  apply  to  real- valued  outputs,  this  paper  considers  only  binary- valued 
functions: 

fi  :  {0,  i}|pre("‘)l  {o,  1} 

The  function  /,  is  a  member  of  a  given  set,  7,  of  functions,  called  a  node  function  set. 
Typically,  connectionists  have  used  the  set  of  linearly  separable  functions  (LSFns) 
for  7 .  These  functions  are  characterized  by  a  real- valued  threshold  and  a  real- valued 
weight  associated  with  each  input  to  a  node.  An  activation  level  is  calculated  from 
the  weighted  sum  of  the  a  inputs,  and  the  output  is  one  of  two  values  depending 
on  whether  the  activation  is  above  or  below  the  threshold. 

LSFns  =  {/  :  {0, 1}*  -  {0, 1}  |  31V,  6  €  SR  £  W, X,  >  0  e  f(X )  =  1} 

t=l 

We  consider  LSFns  as  well  as  a  variety  of  other  node  function  sets.  Two  variants 
that  are  considered  are  called  AOFns  and  LUFns.  AO F ns  is  the  2-element  set 
{AND, OR}  (AOFns  is  from  And-Or  Functions).  LUFns  is  the  set  of  all  Boolean 
functions  (LU  is  from  Look-Up  table).  Note  the  inclusion  hierarchy  LUFns  D  LSFns 
D  AOFns  for  any  given  fan-in. 

We  also  consider  two  sets  of  node  functions  that  have  real  values.  Quasi-linear 
functions  (QLFns)  are  functions  composed  of  any  bounded,  monotonic  function,  E, 
applied  to  a  linear  combination  of  the  inputs.  (This  definition  is  essentially  the  same 
as  that  used  in  [RHM86]  and  [Wil86a].)  A  special  case  of  QLFns  is  the  logistic-linear 
functions  (LLFns),  tor  which  E{x)  =  1/(1  +  e~x).  The  back-propagation  algorithm 
of  [RHW86]  is  designed  to  work  with  LLFns. 

A  configuration  of  a  network  is  a  set  of  n  functions  F  =  {/i,  /21  •  •  •  >  fn}  c°r' 
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responding  to  the  set  of  nodes,  V ,  meaning  that  /,  is  the  function  that  node  i 
computes. 

2.4  The  Computational  Problem 

In  a  configured  network,  every  node  performs  a  particular  function  and  therefore 
the  network  as  a  whole  performs  a  particular  function  which  is  a  composition  of 
the  node  functions.  An  architecture,  A,  and  a  configuration,  F,  together  define  a 
mapping  from  the  space  of  stimuli  to  the  space  of  responses: 

Mj  :  {0,1}*  -  {0,l}r. 

The  A  and  F  fully  define  a  circuit  and  thus  fully  define  how  the  network  will  behave 
during  retrieval. 

A  task,  as  defined  above,  can  be  viewed  as  a  collection  of  constraints  on  the 
mapping  that  a  network  is  allowed  to  perform.  Recall  that  an  SR  item  in  a  task  is 
a  pair  of  strings  {cr,p).  When  the  posts  in  S  are  assigned  the  values  of  respective 
elements  of  o,  the  network  mapping  defines  values  for  each  post  in  R.  It  is  required 
that  these  values  agree  with  respective  elements  of  p.  For  stimuli  not  in  the  task,  any 
output  is  acceptable — that  is,  Mp  may  be  any  consistent  extension  (generalization) 
of  the  task. 

The  process  of  loading  can  now  be  defined.  In  the  learning  problem  we  are  con¬ 
sidering,  an  architecture  and  a  task  are  given,  and  loading  is  the  process  of  assigning 
an  appropriate  response  function  to  every  node  in  the  architecture,  load(A,  T)  F , 
so  that  the  derived  mapping  includes  the  task.  It  is  a  procedure  that  accepts  a  pair 
(A,T)  and  returns  a  solution ,  which  is  a  configuration  F  such  that  T  C  Mp.  If  no 
such  configuration  exists,  the  procedure  announces  that  fact. 

The  loading  problem  is  a  search  problem,  but  it  is  usual  to  frame  complexity 
questions  in  terms  of  decision  problems.  In  the  space  of  all  possible  (A,T)  pairs, 
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some  pairs  will  have  solution  configurations  and  some  will  not;  that  is,  for  some  pairs 
the  architecture  can  perform  the  task,  and  for  some  it  cannot.  The  performability 
decision  problem  is  simply:  “Can  the  architecture  perform  the  task?’'  In  the  style 
of  [GJ79],  this  is  phrased  as  follows: 

Instance:  An  architecture  A  and  a  task  T. 

Question:  Is  there  a  configuration  F  for  A 

such  that  T  C  {(a,  p)  :  Mp{a)  agrees  with  p}? 

For  purposes  of  our  ensuing  complexity  questions,  the  decision  problem  embodies 
the  crux  of  the  loading  problem. 

Note  that  the  above  statements  are  technically  incomplete  because  they  hold  no 
direct  reference  to  the  node  function  set  being  used.  Our  next  (and  last)  re-phrasing 
of  the  loading  problem  redresses  this  oversight,  and  uses  classical  terminology  for 
expressing  decision  problems:  The  performability  problem  is  the  problem  of  recog¬ 
nizing  the  following  (parameterized)  language: 

PerfT  =  {(A,  T)  :  3 F  €  ?n  3  T  C  .M^}. 

The  subscripted  parameter  indicates  the  node  function  set.  We  will  be  asking 
questions  about  a  variety  of  such  sets,  and  each  time  we  change  the  subscript  we 
are  referring  to  a  slightly  different  decision  problem. 

2.5  Classical  Connectionist  Learning 

The  dominant  paradigm  in  current  connectionist  supervised  learning  research  fol¬ 
lows  a  style  established  by  the  Perceptron  many  years  ago.  The  following  algorithm 
illustrates  the  style  using  one  version  of  the  Perceptron  Learning  Rule.  Let  W  be 
a  set  of  weights  and  X  be  an  input  vector.  Let  a  be  one  greater  than  the  fan-in  to 
a  node.  Then  using  a  simple  trick  of  notation,  the  threshold  is  treated  as  one  of 
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the  weights  so  it  does  not  explicitly  appear  in  the  expression  of  the  linear  threshold 

function. 

start: 

choose  any  arbitrary  set  of  weights  W  G 

test: 

accept  an  input  X  6  9?a 
and  a  classification  c  G  {0, 1} 
if  W  ■  X  >  0  and  c  =  1 
or  W  ■  X  <  0  and  c  =  0  go  to  test 
adjust: 

if  W  ■  X  >  0  and  c  =  0  set  W  «-  W  -  X 
if  W  ■  X  <  0  and  c  =  1  set  W  <-  W  +  X 
go  to  test 

Whenever  the  weights  are  adjusted,  the  retrieval  function  can  change.  The 
process  of  adjusting  the  weights  is  therefore  conceptually  the  same  as  choosing  a 
node  function  /  G  LSFns ,  as  we  phrase  it. 

As  exemplified  above,  the  learning  process  in  the  classical  paradigm  is  a  cyclic 
repetition  of  the  following  steps: 

1.  A  stimulus  is  received  from  the  environment. 

2.  An  output  is  calculated  by  the  retrieval  process. 

3.  Some  form  of  information  about  the  correctness  of  the  output  is  given. 

4.  A  determination  is  made  about  how  a  change  in  each  weight  would  individually 
affect  the  overall  performance  of  the  network. 

5.  All  the  weights  are  changed  according  to  what  step  (4)  would  determine  to 
be  an  improvement. 

The  determination  in  step  (5)  is  made  based  on  information  available  locally  at  the 
node  in  question,  which  in  the  case  of  the  Perceptron  was  only  the  value  of  the 
input  relevant  to  the  weight  in  question  and  whether  or  not  the  most  recent  output 
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was  correct.  Note  that  as  a  consequence,  each  determination  regarding  a  we; 3 hr 
is  independent  of  every  other  determination,  and  the  amount  of  information  they 
mutually  have  access  to  is  minimal. 

In  step  (3)  above,  some  form  of  information  about  the  correctness  of  the  output 
is  given.  This  can  come  in  various  forms,  but  the  most  direct  form  is  simply  to  be 
given  the  correct  answer.  When  such  is  the  case,  the  protocol  is  called  supervised 
learning.  Variants  on  this  protocol  include  reinforcement  learning  in  which  the  only 
information  given  is  a  scalar  evaluation  of  how  ‘good’  the  output  from  step  (2) 
was.  Both  these  schemes  can  be  complicated  by  introducing  noise  into  the  data. 
In  such  a  case,  the  information  supplied  in  step  (3)  has  only  a  certain  probability 
of  being  correct,  so  the  system  is  faced  with  a  problem  in  stochastic  optimization. 
The  literature  on  Learning  Automata  has  studied  this  problem  [NT74,NL77,TR81| 
but  we  will  avoid  it. 

Most  connectionist  research  has  attempted  to  comply  with  some  present-day 
notions  of  neurological  plausibility.  This  tradition  is  very  much  attached  to  linear 
sum  node  functions  (primarily  LSFns  and  LLFns)  but  the  major  characteristic  of  the 
style  has  to  do  with  the  way  in  which  the  algorithm  interacts  with  the  environment 
and  with  its  internal  state  variables.  A  so-called  ‘neural’  algorithm  has  a  style 
substantially  akin  to  the  Perception’s  in  the  following  senses:  the  loading  component 
operates  with  minimal  information  beyond  what  the  retrieval  component  uses;  each 
node  acts  independently  and  somewhat  simultaneously  in  adjusting  its  weights;  and 
every  node  relies  on  information  locally  available  only,  where  locality  is  defined  as 
per  connectivity  in  the  net.  As  is  true  for  the  Perceptron,  a  connectionist  system 
often  has  no  state  variables  except  its  weights,  and  the  meaning  of  these  weights  is 
fully  defined  by  the  retrieval  algorithm.  There  is  no  significance  to  these  variables 
beyond  what  the  retrieval  algorithm  gives  them. 

At  this  point  in  time  it  is  difficult  to  formulate  exactly  what  would  be  acceptable 
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as  ‘neural’,  but  a  loose  forma!  model  of  this  idea  would  at  least  specify  a  constant 
number  of  bits  of  memory  associated  with  each  edge  of  the  architecture  and  a 
constant  number  of  bits  associated  with  each  node  (independent  of  network  size). 

Any  scheme  adhering  to  the  general  5-point  procedure  outline  above  can  be 
treated  as  an  ‘on-line’  algorithm.  An  on-line  system  is  one  that  guesses  a  response 
to  each  given  stimulus  before  it  is  told  the  required  response  and  is  held  to  account 
for  these  guesses.  One  could  view  this  as  a  model  of  an  adaptive  system  having  to 
make  tactical  decisions  in  an  ongoing  environment.  Of  course,  any  such  system  will 
sometimes  make  mistakes  in  its  guesses;  it  is  interesting  to  find  upper  and  lower 
bounds  on  the  number  of  mistakes  an  on-line  system  will  make. 

2.6  Discussion 

There  are  three  ways  in  which  this  formulation  seems  to  stray  from  the  connectionist 
paradigm.  First,  there  is  nothing  in  our  model  of  learning  that  reflects  any  of  the 
neural  desiderata.  However,  this  means  that  any  intractability  result  will  therefore 
be  conservative.  We  have  chosen  not  to  adhere  to  these  neural  constraints  in  order 
to  strengthen  our  results. 

Second,  connectionists  are  apt  to  find  the  perfor inability  problem  a  strange 
formulation  because  from  their  point  of  view  the  architecture  is  not  an  input  to  the 
problem  but  rather  a  specification  of  the  machine  that  is  to  solve,  the  problem.  The 
reason  for  this  formal  rearrangement  lies  in  the  research  strategy  of  the  connectionist 
community.  The  prevailing  goal  has  been  to  find  a  ‘learning  rule’  that  can  be 
employed  in  each  node  of  a  network  to  pass  information  back  and  forth,  to  witness 
the  various  task  inputs  and  errors  made  during  early  operation,  and  to  eventually 
settle  on  a  node  function  (i.e.  a  set  of  weights)  that  will  eliminate  output  errors. 
The  point  to  note  is  that  this  search  for  learning  rules  has  historically  been  a  search 
for  a  universal  rule— one  that  could  afford  to  be  oblivious  to  the  type  of  architecture 
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into  which  the  rule  will  be  deposited  to  do  its  work.  It  has  been  implicitly  hoped 
that  the  architecture  has  little  to  do  with  the  difficulty  of  loading.  Of  course  it 
was  recognized  that  the  architecture  has  a  great  influence  over  what  mappings  tan 
be  performed,  but  after  assuming  that  the  network  was  adequate  to  perform  a 
given  task,  it  was  hoped  that  a  general  purpose  (i.e.  a  non-architecture-specific) 
learning  rule  would  be  able  to  configure  the  weights  correctly.  Therefore  we  can 
freely  vary  the  architecture  when  formulating  the  computational  problem  faced  by 
such  a  learning  rule;  i.e.  we  can  make  it  an  input  parameter. 

Third,  Connectionists  might  again  object  to  this  formulation  because  of  the  lack 
of  any  mention  of  the  architecture  in  the  model  of  computation  that  will  be  used 
to  actually  find  the  configuration.  By  omitting  any  reference  to  the  machine,  the 
phraseology  above  implicitly  poses  the  problem  in  terms  of  a  Turing  machine,  or 
at  least  in  terms  of  some  standard  serial  model  of  computation.  The  connectionist 
approach  is  to  run  the  learning  rule  in  all  nodes  of  the  network  simultaneously,  so 
one  wonders  if  this  parallel  model  might  not  be  more  powerful.  Perhaps  so,  but 
note  that  the  number  of  nodes  is  at  most  linear  in  the  problem  size  (since  the  input 
includes  a  specification  of  the  architecture),  so  the  speed-up  due  to  parallelism  can 
be  no  more  than  linear.  In  the  face  of  the  iVP-completeness  result  to  follow  in 
Chapter  4,  this  is  inconsequential. 

A  word  about  our  use  of  the  word  “loading” .  The  connectionist  literature  rather 
uniformly  uses  the  word  “learning”;  why  should  we  use  another  word?  Firstly,  the 
word  “learning”  is  used  in  AI  and  other  fields  to  refer  to  a  great  variety  of  different 
things  and  it  is  useful  to  distinguish  some  of  these  uses  from  others.  Secondly, 
although  the  connectionist  literature  is  fairly  consistent  about  what  their  learning 
problem  is,  our  loading  problem  is  not  exactly  the  same  as  theirs  either,  so  it 
behooves  us  to  be  precise  by  using  a  different  name  for  a  different  problem. 

Specifically,  the  loading  problem  involves: 
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1.  a  given,  (previously  unknown)  network, 

2.  total,  easy,  ongoing  access  to  the  network  structure, 

3.  a  given,  (previously  unknown)  task,  and 

4.  total,  easy,  ongoing  access  to  all  items  in  the  task, 

where  by  ‘total’  we  mean  freedom  from  locality  constraints;  by  ‘easy  we  mean 
0(|A|)  cost  to  read  the  whole  data;  and  by  ‘ongoing’  we  mean  there  is  no  limit  to 
the  number  of  accesses  allowed. 

Of  the  four  aspects  listed  here,  (3)  is  certainly  true  for  the  classical  connectionist 
learning,  and  (l)  is  often  true  although  implicit.  Both  (2)  and  (4)  are  usually 
not  part  of  classical  connectionist  learning.  Strangely,  although  (3)  is  seemingly 
paramount,  it  may  be  the  least  important  aspect  of  the  model  in  the  sense  that 
knowing  the  task  in  advance  might  not  make  any  difference  to  the  computational 
complexity  of  the  problem. 

Note  that  (4)  implies  noise-free  supervised  learning — the  input  data  are  always 
dependable  in  that  items  are  always  consistent.  It  also  implies  knowledge  of  the 
exact  number  of  items,  something  that  classical  models  do  not  have  access  to. 
Aspect  (2)  indirectly  implies  that  the  architecture  is  fixed  and  cannot  be  altered 
during  loading.  Both  (2)  and  (4)  more  or  less  imply  that  a  Turing  machine  will  be 
employed  to  perform  the  loading  function;  the  algorithm  is  not  required  to  run  on 
a  distributed  machine. 

The  loading  problem,  then,  is  our  formalization  of  a  particular  computational 
problem  which  is  closely  akin  to  classical  connectionist  learning  but  is  altered 
slightly  to  be  on  the  easy  side  of  three  major  issues: 

•  the  type  of  machine  used  to  solve  it, 

•  the  style  of  processing  required,  and 
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•  the  type  of  information  available. 

It  should  be  noted  that  when  NP- completeness  is  found  with  this  model,  it  is  espe¬ 
cially  germane  because  we  have  focussed  on  the  easiest  and  least  restrictive  condi¬ 
tions  for  all  three  of  these  issues.  By  applying  automatically  to  many  of  the  more 
difficult  cases,  the  theory  is  particularly  relevant. 

When  we  find  loading  to  be  difficult,  we  will  know  that  the  classical  connectionist 
learning  problem  must  be  at  least  as  difficult;  When  we  find  loading  to  be  easy,  we 
will  only  have  suggestive  evidence  that  the  classical  connectionist  learning  problem 
is  easy. 
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Chapter  3 

REVIEW  OF  RELATED  WORK 


Some  background  in  other  formal  learning  theories  will  help  put  our  work  in  per¬ 
spective.  It  will  also  help  explain  why  our  theory  will  help  connectionist  design 

problems,  whereas  the  others  will  not. 

There  is  a  long  tradition  of  research  on  the  problem  of  inferring  a  general  rule 
to  describe  a  set  of  specific  examples.  Philosophers  [Bac42,Car50],  then  cyberneti¬ 
cists,  cognitive  psychologists  [HMS66],  engineers,  and  more  recently  AI  researchers 
[Mit77,DM81]  have  all  considered  the  problem.  In  a  distilled  form,  the  quest  is  to 
find  a  procedure  that  can  take  objects  representing  positive  and  negative  examples 
of  a  concept  and  find  an  expression  (in  some  form  to  be  discussed)  that  expresses 
whether  a  given  object  is  a  positive  or  negative  example  of  the  underlying  concept. 
This  type  of  process  is  a  major  component  of  what  is  colloquially  called  learning  . 
Such  inference  procedures  could  be  used  for  classifying  unseen  examples,  for  pre¬ 
dicting  future  events,  for  storing  data  in  a  compressed  format,  or  just  for  storing 
data  in  a  convenient  format.  The  last  two  purposes  might  be  valid  even  when  all 
instances  of  the  concept  have  already  been  witnessed.  Our  primary  motivation  is 
similar  to  the  last  one— to  store  data  in  a  particular  format. 

Some  mathematical  models  of  learning  from  examples  have  recently  been  de¬ 
veloped.  We  will  review  the  relevant  aspects  of  the  learning  formalisms  defined  by 
Gold  and  by  Valiant,  and  then  contrast  our  formalism  with  theirs. 
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p  is  a  learning  procedure. 

(  is  a  text.  A  text  is  an  enumeration  of  an  r.e.  set,  which  is 
equivalent  to  strings  in  a  language.  Every  r.e.  set  is  equal  to 
the  domain  of  some  function  p, . 

Wi  is  the  domain  of  <pi. 
tj  is  the  first  j  elements  of  t. 

L  is  a  language. 

£  is  a  class  of  languages. 


ip  converges  on  t  to  i 


(a)  p  is  defined  on  t. 

(b)  3n  9  p(fj)  =  p(t„)Vj  >  n. 


p  identifies  t 


(a)  p  converges  on  t  to  some  i. 

(b)  rng(i)  = 


p  identifies  L  <=>  p  identifies  all  texts  for  L. 
p  identifies  £  <=>  p  identifies  every  L  G  £. 

£  is  identifiable  some  p  identifies  £. 


Figure  3.1:  Gold’s  definition  of  learnable  (identifiable) 


3.1  Gold 

Gold  [Gol67]  established  a  field  of  learning  theory  in  1967  which  he  labelled  ‘identi¬ 
fication  in  the  limit’.  He  asked  whether  there  is  a  procedure  which  could  read  in  an 
endless  sequence  of  example  strings  in  a  language  and  eventually  find  a  grammar 
for  the  language.  See  Figure  3.1  for  a  formal  definition. 

Many  other  researchers  [BB75,Cho80,OS W86,WC80,AS83,Sha81]  have  built  up 
the  theory.  The  questions  concern  infinite  languages,  and  therefore  they  involve 
infinite  ‘texts’,  meaning  that  the  system  must  see  unbounded  amounts  of  data.  In 
the  main,  the  questidns  asked  place  no  bounds  on  time  or  space  required  for  learning. 
The  flavour  is  very  much  like  computability  theory  as  opposed  to  complexity  theory. 
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F  is  a  class  of  programs  (concepts). 
p  is  a  polynomial. 

A  is  an  algorithm, 
e,  <5  are  probabilities, 
n  is  a  positive  integer. 

/  and  g  are  programs. 

D+  is  a  probability  distribution  of  positive  examples. 

D~  is  a  probability  distribution  of  negative  examples. 

(3p,  A)  such  that 

(Vn)(V/  e  Fn){VD+,D~){Vc,6  >  0) 

A  halts  in  time  p(n,  size(f),  1/e,  1/S) 
with  output  g  £  Fn  that 
with  probability  >  1  —  <5 
has  property  £?(z)=o  D+(x)  <  e 
and  property  £a(*)=i  D~  (£)  <  e 

diant’s  definition  of  learnable 


“F  is  learnable”  <=>  { 


Figure  3.2:  V< 


3.2  Valiant 

Valiant  (Val84]  established  a  lower-level  field  of  learning  which  has  subsequently 
been  elaborated  by  himself  and  others  [Val85,PV86,KLPV87].  His  paradigm  is 
concerned  not  just  with  what  is  learnable  but  with  what  is  feasibly  learnable  (see 
Figure  3.2).  The  definition  of  feasibility  relies  on  the  well-honoured  distinction 
between  ‘polynomial’  problems  and  ‘super-polynomial’  (or  NP-hard)  problems. 

Valiant’s  definition  concerns  data  represented  by  a  fixed  (finite)  number  of  vari¬ 
ables  which  typically  are  all  binary-valued.  The  learning  system  must  discover  the 
underlying  rule  that  describes  whether  such  a  given  bit  string  is  or  is  not  an  exam- 
pie  of  a  ‘concept’.  The  learner  views  example  after  example  and  tries  to  deduce  a 
description  of  the  concept,  so  it  is  similar  to  Gold’s  paradigm  in  this  regard,  but  it 
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is  different  from  Gold’s  in  at  least  three  other  regards: 


1.  fixed-length  bit  strings,  ergo  finite  bodies  of  data  to  purview; 

2.  bounded  time  to  accomplish  the  learning,  specifically  time  bounded  by  a  poly¬ 
nomial  in  various  parameters  of  the  problem; 

3.  specific  guidelines  as  to  the  form  of  the  concept  description. 

The  third  difference  is  the  most  fundamental  to  the  formulation  and  is  the  most 
germane  to  our  discussion.  Basically,  Valiant’s  theory  is  intended  to  determine 
whether  concepts  of  a  certain  class  are  easy  to  learn.  For  instance,  if  a  concept  can 
be  expressed  in  conjunctive  normal  form  with  at  most  4  variables  per  disjunct,  is 
it  possible  to  deduce  that  expression  from  seeing  examples  alone?  If  a  concept  can 
be  expressed  as  a  disjunct  of  two  conjuncts,  is  it  possible  to  deduce  that  expression 
from  seeing  examples  alone? 

His  definition  of  learnability  has  some  important  other  subtleties  which  capture 
probabilistic  aspects  of  generalization.  These  are  discussed  in  Section  7,  but  are 
not  relevant  to  the  present  purposes. 

3.3  Our  Model 

The  present  work  describes  a  third  field  of  learning  theory  which  might  be  viewed 
as  the  lowest  level  of  the  three.  It  is  inspired  by  the  computational  problem  under¬ 
lying  the  connectionist  approach  to  learning.  Whereas  Valiant  differed  from  Gold 
primarily  on  the  issue  of  time,  Valiant  and  this  work  differ  primarily  on  the  concern 
for  the  circuit  involved  in  representing  the  data.  Our  paradigm  is  concerned  not 

just  with  what  is  feasibly  learnable,  but  with  what  is  feasibly  learnable  in  a  machine 

/ 

with  a  certain  fixed  structure.  It  shares  the  same  similarities  and  differences  with 
Gold  as  Valiant,  but  its  “specific  guidelines  as  to  the  form”  of  the  representation 
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A  is  a  design  class  of  architectures  (networks). 

A  is  an  architecture  (network). 
p  is  a  polynomial. 

B  is  an  algorithm. 

F,  G  are  configurations  for  A  (i.e.  settings  for  all  the  adjustable 
variables  in  all  nodes  of  A). 

Mp  is  the  behaviour  of  A  when  configured  with  F. 

T  C  {(<j,  p)\M.p{o)  —  p)  is  a  task. 


“A  is  loadable” 


(3p,  B )  such  that 

(VA  6  A)(VF  for  A)(VT  C  Mp) 

B  halts  in  time  p(jA|  +  jT|)  with 
output  G  such  that  T  C  Mq 


Figure  3.3:  Our  definition  of  learnable  (loadable) 


i 


(item  3  above)  are  even  more  strict  than  are  Valiant’s.  See  Figure  3.3.  It  requires 
that  a  representation  for  the  learned  data  be  found  that  can  be  embodied  in  a  spe¬ 
cific  network  structure.  To  achieve  it,  details  of  the  function  at  each  point  in  the 
net  are  alterable,  but  no  alterations  to  the  connectivity  of  the  network  are  allowed. 
For  example,  if  the  given  network  were  this: 


x 

y 

z 

and  the  data  to  be  learned  were  values  for  /,  x,y,  and  z  such  that  /  were  some 
function  of  x,y,  and  2,  then  the  objective  of  the  learning  system  would  be  not 
only  to  discover  the  function  /(x,  y,z)  but  also  to  find  3  more  functions  a,  6,  and  c 
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such  th  ;.t  f  (x,  y,  z)  =  c(x,a(x,y),b(y,z)).  This  is  more  constrictive  than  Valiant's 
formuk  lion  because  Valiant  places  only  general  grammatical  guidelines  on  the  form 
of  /  where  we  have  an  exact  expression,  minus  only  the  specifications  of  a,  6,  and  c. 
This  prior  knowledge  of  the  form  does  not  make  the  general  learning  paradigm  any 
easier  or  harder,  but  merely  different.  It  asks  a  question  about  whether  a  particular 
network  can  be  made  to  represent  some  data,  not  whether  it  is  possible  to  to  find 
some  network  to  represent  those  data. 

3.4  Comparison  Summary 

This  section  summarizes  the  similarities  and  differences  between  the  three  learning 
paradigms  being  considered. 

3.4.1  Requirements 

In  broad  terms,  each  formulation  is  phrased  as  “For  each  member  of  problem  class 
X,  and  for  each  of  many  different  ways  of  presenting  the  learning  data,  the  class  is 
said  to  be  learnable  if  there  is  a  dependable  way  to  remember  the  data.”  The  first 
two  clauses  in  this  sentence  correspond  to  different  formalisms  in  each  paradigm: 


paradigm 

For  each  member 

of  the  class 

For  each  presentation 
of  the  data 

i  Gold” 

lyLet) 

(V  texts  for  L ) 

Valiant 

(v/  €  F) 

(vDr,/T)  ; 

- — — - j 

i 

(VA  e  A) 

!  this  work 

(V  configurations  F  for  A) 

:  (V  permutations  of  T ) 

(vr  c  m£) 

i 

3.4.2  Motivation 

Gold’s  study  of  Comparative  Grammars  is  an  attempt  to  characterize  the  class 
of  natural  languages  through  formal  specification  of  their  grammars.  His  original 
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motivation  for  setting  up  a  formal  learning  theory  was  not  to  uv-d  a;  rung  for 

its  own  sake  but  to  develop  a  tool  to  understand  natural  iargritam-.  *'?-•  at  ••'•'noting  to 
define  what  a  natural  language  is,  he  looked  for  constraints  on  u,  for  -t  provided  by 
the  observable  fact  that  2-year-olds  can  learn  it.  Furthermore,  they  learn  it  mostly 
by  listening  to  others  speak  it.  Thus  formal  learning  theory  was  originally  conceived 
to  assist  the  comparative  study  of  grammars  but  it  ultimately  might  contribute  to 
theories  of  psychology  or  neurai  architecture. 

Valiant’s  motivation  can  be  viewed  as  an  attempt  to  develop  a  foundation  for 
learning  in  AI.  He  wants  to  discover  good  models  relevant  to  building  devices  that 
can  learn,  and  to  find  the  limits  to  what  can  be  feasibly  learned.  True  to  traditions 
of  A  I,  he  uses  an  abstract  Turing  machine  as  the  model  of  computation.  Hence 
his  paradigm  is  not  concerned  with  any  structural  or  functional  constraints  on  the 
algorithms;  it  merely  asks  that  an  algorithm  complete  its  task  within  a  certain 
amount  of  time.  Valiant’s  formulation  is  exactly  relevant  to  AI,  which  shares  these 
same  freedoms  and  constraints. 

Our  motivation  arises  from  the  search  to  understand  a  very  particular  compu¬ 
tational  model.  Like  Gold,  we  use  it  not  directly  for  its  own  sake  but  as  a  tool  to 
constrain  an  ulterior  theory.  In  seeking  to  understand  connectionist  computation, 
we  seeks  constraints  on  network  design  provided  by  the  computational  feasibility  of 
learning. 

3.4.3  Quantitative  comparison 

The  various  quantities  involved  in  the  three  formulations  are  compared  in  this  table: 


variables 

input  size 

output  size 

time 

data  structure 

Gold 
Valiant 
this  work 

infinite 

fixed' 

fixed 

infinite 

bounded 

bounded 

finite 

bounded 

fixed 

finite 

bounded 

bounded 

any 

constrained 

fixed 
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3.4.4  ‘Grammatical  Focus’ 

Gold  studies  the  relation  between  finite  evidence  and  infinite  languages.  This  neces¬ 
sarily  involves  a  grammar,  but  the  form  of  the  grammar  is  not  explicitly  mentioned. 
For  example,  he  asks  whether  it  is  possible  to  find  a  description  (i.e.  a  grammar) 
for  any  given  recursively  enumerable  set. 

We  can  cast  the  other  2  models  in  terms  of  grammars  and  languages  as  well.  The 
data  structure  resulting  from  Valiant-type  learning  can  be  seen  as  a  ‘sentence’  in  a 
language  described  by  grammatical  syntax  rules,  where  neither  the  words  nor  their 
interrelationships  are  known  a  priori  but  where  the  grammar  serves  as  a  validity 
test  for  the  sentence.  Valiant  studies  the  relation  between  such  a  grammar  for 
representation  and  the  complexity  of  finding  an  appropriate  sentence  complying 
with  that  grammar.  For  example,  he  asks  how  hard  it  is  to  find  a  3-CNF  expression 
for  a  given  body  of  data.  The  grammar  involved  is  ‘3-CNF-ness’,  and  the  sentence 
sought  would  have  to  be  a  3-CNF  expression. 

We  study  the  relation  between  a  grammar  and  the  complexity  of  finding  words 
to  fill  a  specific  sentence  structure  from  that  grammar.  Our  learned  data  structure 
is  more  particular  than  Valiant’s  in  that  it  is  not  any  sentence  from  some  grammar, 
but  is  a  particular  sentence  from  it.  For  example,  we  ask  how  hard  it  is  to  load  a 
network  drawn  from  the  family  of  two-layered  networks.  Two-layer ed-ness  would 
be  the  ‘grammar’.  The  specific  ‘sentence’  involved  could  be  the  example  network 
used  on  page  28,  and  using  that  example  the  ‘words’  sought  would  be  specifications 
of  the  functions  a,  6,  and  c.  The  position  and  relationships  of  each  ‘word’  are  fully 
specified  in  advance,  and  the  learning  system  need  only  discover  what  the  missing 
words  are. 
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3.4.5  Environment 


One  of  Gold’s  original  definitions  is  of  an  informant ,  which  is  a  particular  kind  of 
‘environment’,  or  protocol  for  interaction  between  a  learning  system  and  a  source 
of  data.  An  informant  is  an  environment  in  which  strings  are  presented  serially  to  a 
machine  paired  with  an  indication  of  whether  that  string  is  in  the  target  language 
or  not. 

Like  Gold,  Valiant  explores  a  variety  of  environments,  but  one  environment 
is  quite  similar  to  an  informant.  His  terminology  for  it  is  ‘positive  and  negative 
examples’  of  a  concept. 

Our  protocol  for  gathering  information  is  also  quite  similar  to  Gold’s  informant. 
The  learner  is  presented  with  pairs  of  strings  called  stimulus  and  response.  The 
object  is  to  remember  what  response  is  appropriate  for  each  stimulus.  If  the  response 
string  were  only  one  bit  long,  it  would  be  equivalent  to  saying  IN /OUT  (a  la  Gold) 
or  POS/NEG  (a  la  Valiant).  The  response  string  is  a  useful  generalization  of  the 
one-bit  notion  but  is  not  a  conceptual  deviation  from  the  basic  idea. 

Hence  we  consider  the  three  paradigms  as  having  nearly  equivalent  learning 
environments.  In  fact,  it  is  this  commonality  of  supervised  learning  that  makes  the 
comparisons  meaningful. 

3.5  Studies  in  Connectionist  Learning 

Many  researchers  have  developed  algorithms  for  supervised  learning  in  connectionist 
networks.  A  good  review  is  given  by  Hinton  [Hin8 7} .  Some  of  the  approaches 
most  relevant  to  our  study  are  the  Perceptron  [Ros61,MP72],  linear  associators 
[And72,Koh77,Koh84]  back-propagation  [RHW86,Par85,lC85],  and  the  associative 
reward-penalty  {ARLP)  scheme  [BA85,Bar85].  All  of  these  are  ‘neural’  algorithms 
for  feed-forward  networks. 
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A  neural  algorithm  has  "be  a  given  for  Boltzmann  machines  [AHS85,HS86],  but 
the  Boltzmann  machine  is  a  recurrent  network.  Hopfield  [Hop82]  gives  a  non-neural 
method,  also  for  training  recurrent  (and  thus  dynamic)  networks.  Our  work  does 
not  address  this  style  of  retrieval  mechanism.  For  unsupervised  learning  paradigms, 
research  has  been  done  including  [RZ85,  and  references  therein].  The  present  work 
does  not  speak  directly  to  this  paradigm  either  so  none  of  it  will  be  reviewed  here. 

Analyses  of  the  feed-forward  models  have  been  mostly  for  a  single  linear  thresh¬ 
old  unit  or  for  a  2-layered  machine  where  only  one  layer  is  trainable  (Perceptron). 
A  layered  machine  is  one  where  the  nodes  are  divided  into  disjoint  sets  called  layers, 
network  inputs  are  connected  only  to  the  first  layer,  and  subsequent  layers  get  their 
input  signals  only  from  a  previous  layer.  There  have  also  been  some  investigations  of 
more  general  structures,  which  we  will  review  after  considering  the  work  on  simple 
networks. 

3.5.1  Simple  Networks 

In  the  ‘one-layer’  case,  learnability  results  span  a  great  range.  Some  problems  are 
impossible  to  solve;  some  can  be  solved  ‘in  the  limit’,  i.e.  by  using  infinite  time;  some 
have  time  bounds  that  are  known  only  to  be  finite;  some  have  exponential  time; 
some  polynomial;  and  some  logarithmic.  The  scaling  arguments  are  with  respect  to 
s,  the  number  of  bits  in  the  input  vector/string.  They  are  considered  here  in  this 
same  order. 

Impossible:  There  are  not  nearly  as  many  linearly-separable  functions  as  there 
are  general  Boolean  functions  on  {0,  l}5,  so  most  Boolean  functions  on  a  large 

number  of  variables  can  not  be  performed  (or  perforce,  learned)  by  a  single-node 

/ 

linear  threshold  unit.  In  their  book  Perceptrons:  An  Introduction  to  Computational 
Geometry  [MP72],  Minsky  and  Papert  answered  questions  regarding  the  functional 
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powers  of  the  2-layer  model  and  characterized  classes  of  functions  that  could  not 
be  performed  when  both  layers  have  bounded  fan-in.  Of  course,  any  function  can 
be  performed  with  an  exponential  fan-in,  but  this  is  clearly  impractical. 

Infinite:  Several  asymptotic  results  have  been  given  for  stochastic  approximation 
methods  [SW81,DH73],  for  stochastic  Learning  Automata  [NT74],  and  for  a  combi¬ 
nation  of  these  [BA85].  For  instance,  when  placed  in  a  stochastic  setting,  and  mod¬ 
ified  by  gradually  reducing  the  adjustment  constant,  the  classic  Widrow-Hoff  rule 
[  WH60]  has  been  shown  to  converge  asymptotically  to  the  solution  of  least  squared 
error  with  probability  1.  Another  convergence  theorem  was  given  by  Barto  and 
Anandan  [BA85]  for  a  difficult  reinforcement  training  protocol  that  involves  noisy 
data  and  an  impoverished  form  of  feedback.  They  prove  in  a  restricted  case  that 
the  stochastic  AR-P  procedure  in  one  node  will  almost  surely  converge  to  correct 
responses.  But  these  convergence  theorems  are  only  for  asymptotic  performance, 
which  means  the  time  upper  bound  is  infinite. 

Finite:  Rosenblatt  [Ros61]  and  others  proved  a  theorem  stating  that  the  various 
Perceptron  learning  rules  will  eventually  converge  to  correct  weights  if  such  weights 
do  exist.  See  Nilsson  [Nil65]  for  notes  on  the  history  of  its  various  proofs.  This 
development  demonstrated  that  the  Perceptron  would  learn  in  finite  time,  even 
though  it  was  a  very  simple  and  ‘neural’  device. 

Exponential:  Muroga  [Mur65]  showed  that  there  are  linearly  separable  functions 
whose  weights  are  approximately  as  large  as  2s.  Thus  even  when  the  function  is 
performable,  it  will  take  the  various  Perceptron  learning  rules  Q(23)  adjustments 
before  getting  acceptable  weights.  Hampson  and  Volper  [HV86]  extended  the  ar¬ 
gument  to  the  average  case  (as  opposed  to  the  worst  case)  and  derive  a  bound  of 

0(1.43). 
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Tesauro  (Tes87[  measured  '  erning  t  ns  as  a  function  of  the  size  of  the  task.  He 
used  3  networks  of  a  particular  style,  or,  j  particular  algorithm  (back-propagation), 
and  one  particular  function  from  which  ...a  draws  t  random  items  to  make  up  a  task. 
He  then  plotted  learning  time  as  a  function  of  t,  and  found  it  to  be  the  sum  of  a 
polynomial  and  an  exponential.  The  polynomial  dominated  in  the  low  ranges  but 
after  a  certain  point,  the  exponential  dominated. 

Polynomial:  Hampson  and  Volper  [HV86,VH86,HV87j  explore  several  algorithms 
and  learning  situations  for  the  single  Perceptron  to  see  how  they  behave  as  the  num¬ 
ber  of  input  dimensions,  s,  is  scaled  up.  They  report  exponential  times  for  ail  but 
a  few  simple  cases.  When  the  additional  dimensions  are  irrelevant  or  redundant,  or 
when  the  task  being  learned  is  an  OR  or  AND,  then  low  polynomials  in  s  are  found. 

Logarithmic:  Littlestone  [Lit87]  found  polynomial  on-line  mistake-bounds  for  a 
variety  of  classes  of  functions.  He  considers  a  node  function  set  with  the  same  form 
as  linear  threshold  functions  but  he  demands  a  minimum  amount  of  separability 
between  the  different  classes.  (This  restriction  is  a  very  appealing  refinement  to 
the  model  of  a  ‘neural’  node  function  set,  since  it  allows  the  separating  plane  to 
be  placed  anywhere  within  a  range  and  thereby  relaxes  the  unrealistic  requirement 
for  arbitrary  precision  in  the  weights.)  For  the  case  where  the  target  function  is 
a  simple  disjunction  of  some  subset  of  the  input  bits,  he  gives  an  algorithm  that 
makes  0{k  log  s)  mistakes,  k  being  the  size  of  the  relevant  subset.  When  learning  k- 
DNF  expressions  (for  some  fixed  k),  his  algorithm  has  an  upper  bound  of  0(kl  logs) 
mistakes.  {I  is  the  length  of  the  expression  learned,  and  s  essentially  measures  the 
number  of  irrelevant  input  bits.)  This  is  remarkable  both  for  being  linear  in  k  and 
for  being  logarithmic  in  s. 

Peied  and  Simeone  [PS85]  proved  that  it  is  NP- complete  to  decide  if  a  function 
given  in  disjunctive  normal  form  is  linearly  separable.  This  problem  is  more  difficult 
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than  ours  in  that  it  has  a  very  short  input  and  must  capture  the  whole  function 
in  a  set  of  weights.  Our  problem  has  a  much  longer  (extensional)  representation  of 
the  desired  function  (which  by  the  definitions  of  complexity  affords  an  algorithm 
more  time  to  run),  and  only  requires  the  net  to  remember  those  items  that  are 
explicitly  given.  So  with  less  to  do  and  more  time  to  do  it,  our  loading  problem  is 
computationally  easier;  therefore  their  result  is  not  tight  enough  for  our  purposes. 

All  these  learning  results  are  for  single  nodes  (possibly  preceded  or  followed  by 
a  layer  of  other  non-learning  nodes).  They  shed  little  light  on  our  question  about 
large,  arbitrarily-shaped  networks. 

3.5.2  Complex  Networks 

Some  attempts  have  been  made  to  analyze  the  behaviour  of  learning  algorithms 
in  the  context  of  composite  networks.  Rumelhart,  Hinton,  and  Williams  [RHW86] 
have  shown  that  when  the  generalized  delta  rule  is  used  in  an  arbitrary  feed-forward 
network  for  making  weight  updates,  the  net  has  a  gradient-descent  behaviour.  This 
is  a  pleasing  result  but  there  are  at  least  two  deficiencies:  l)  No  time  bounds  are 
available  yet,  and  2)  Because  the  surface  in  weight-space  is  multimodal,  the  algo¬ 
rithm  may  descend  into  a  local  minimum  and  thereby  never  discover  fully  correct 
responses. 

Tesauro  and  Janssens  [TJ88]  report  empirical  results  studying  the  relationship 
between  learning  time  and  the  predicate  order,  q ,  of  a  task.  They  measure  a  series 
of  (network,  task)  pairs  parameterized  only  by  q.  The  net  has  q  inputs,  2 q  nodes 
in  the  first  layer  (fully  connected  to  each  input)  and  a  single  output  node  (fully 
connected  to  each  node  in  the  first  layer).  The  task  is  a  complete  listing  of  the 
t  =  2q  items  for  the  parity  function  on  q  bits.  When  trained  using  back-propaga¬ 
tion,  they  observe  learning  times  of  approximately  4q,  Since  the  task  has  size  2?, 
this  means  the  training  time  is  4q/2q  -  2q  times  the  amount  of  data  to  be  learned. 
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This  result  might  also  be  re- interpreted  as  evldcn»  •"  that  the  learning  time  scaled 
exponentially  in  the  size  of  the  network. 

In  summary,  there  is  good  evidence  that  in  g  ieral  training  neural  networks 
is  extremely  time-consuming  for  large  applications  Overall  these  results  give  one 
the  impression  that  some  very  simple  learning  problems  are  easy,  but  when  the 
problems  are  only  slightly  more  difficult  they  become  intractable.  However,  none 
of  the  results  are  really  conclusive  for  networks  of  arbitrary  shape.  Networks  of 
certain  designs  might  find  it  easy  to  learn  functions  that  are  difficult  to  learn  in  the 
particular  one-node  or  two-layer  designs  explored  in  the  literature;  or  perhaps  they 
will  find  easy  ones  hard. 

Even  beyond  the  explicit  studies  reviewed  here,  it  is  widely  acknowledged  that 
as  networks  get  larger  and  deeper,  their  learning  time  grows  prohibitively.  The 
scale-up  issue  is  therefore  an  important  research  problem  for  current  connectionist 
research. 
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Chapter  4 


THE  INTRACTABILITY  OF 

LOADING 


Our  major  question  is  about  the  intrinsic  nature  of  the  learning  problem  we  have 
posed:  How  difficult  is  it  to  load  a  given  task  into  a  given  architecture?  As  discussed 
in  the  previous  chapter,  this  amounts  to  asking  how  much  time  is  required  for  a 
Turing  machine  to  recognize  the  following  language: 

Perfr  =  {{A,T)  :3F  e?n  B  T  C 

(Terminology  used  here  and  the  related  complexity-theoretical  concepts  of  NP- 
completeness  are  explained  thoroughly  in  Garey  and  Johnson  [GJ79].) 

The  measure  of  how  difficult  a  decision  problem  is  must  be  relative  to  the  size  of 
a  particular  instance  of  the  problem.  The  size  of  an  instance  of  the  performability 
problem  is  taken  to  be  the  number  of  bits  that  it  takes  to  represent  the  instance, 
i.e.  the  architecture  and  the  task.  This  number  is  roughly  proportional  to  |A|  +  |Tj. 
As  the  architecture  gets  bigger  or  as  the  task  gets  bigger,  one  would  expect  any 
algorithm  to  take  longer  to  solve  it,  but  the  question  we  would  like  to  answer  is 
“How  much  longer?”  What  is  the  asymptotically  minimum  function  g(x)  for  the 
worst-case  amount  of  time  required  to  solve  an  instance  of  size  x? 

We  prove  below  that  Perf^Qpns  is  NP- complete.  This  means  that  it  belongs  to 
a  class  of  computational  problems  for  which  no  polynomial  time  algorithms  have 
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ever  been  found.  All  NP-complete  problems  can  be  transformed  mto  any  other  NP- 
complete  problem  in  polynomial  time,  so  the  development  of  ?  poly-time  algorithm 
would  automatically  give  a  poly-time  solution  to  all  of  them.  In  fact  it  would 
imply  that  a  deterministic  machine  could  solve  all  the  same  problems  that  could 
be  solved  by  a  non-deterministic  machine  (i.e.  a  machine  with  a  psychic  ability 
to  guess  solutions)  with  no  more  than  a  polynomial  degradation  in  running  time. 
Technically,  this  development  would  be  expressed  as  “P  =  N P” ,  but  it  is  believed  to 
be  exceedingly  unlikely.  Indeed,  decades  of  experience  have  shown  that  the  scale-up 
function  for  any  IVP-co mplete  problem  is  an  exponential  expression  that  becomes 
unmanageably  targe  even  for  small  instances  of  the  problem  [GJ79,  Chapter  1]. 

The  fact  that  PerfAOFns  is  NP- complete  is  not  a  statement  about  the  running 
time  for  one  particular  learning  algorithm — it  is  a  result  about  the  intrinsic  difficulty 
of  the  problem.  Hence  it  is  not  practical  to  try  to  decide  large  instances  of  the 
performability  question.  (The  instance  of  a  loading  problem  is  large  when  the 
network  itself  is  large,  even  though  there  might  only  be  a  small  amount  of  data  to 
be  loaded.)  No  learning  rule  can  always  solve  this  problem  in  polynomial  time. 

Furthermore,  because  this  decision  problem  is  no  harder  than  the  search  problem 
from  which  it  is  distilled,  the  loading  problem  per  se  is  also  intractable.  Assuming 
no  general-purpose  algorithm  can  be  developed  for  use  in  arbitrary  archi¬ 
tectures  that  is  guaranteed  to  load  any  given  performable  task  in  polynomial  time. 
(This  is  true  whether  the  algorithm  is  conceived  as  a  nodal  entity  working  in  a 
distributed  fashion  with  other  nodes,  or  as  a  global  entity  working  in  a  centralized 
fashion  on  the  network  as  a  whole.) 

The  parallelism  inherent  in  most  neural  network  systems  does  not  avoid  this 
intractability.  An  exponential  expression  ( cn )  cannot  be  contained  by  dividing  it 
by  a  linear  expression  (cn.).  In  many  connectionist  approaches  to  learning,  there  is 
a  strong  reason  why  large  numbers  of  computing  elements  will  not  accomplish  the 
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loading  problem  in  feasible  time:  By  doubling  the  number  of  nodes  available,  you  are 
doubling  the  computational  resources  but  you  may  also  be  doubling  {or  squaring!) 
the  amount  of  computing  that  has  to  be  done.  Naive  attempts  to  exploit  parallelism 
can  actually  be  counterproductive. 

Hence  it  might  appear  that  we  cannot  hope  to  build  large  connectionist  networks 
that  will  reliably  learn  simple  supervised  learning  tasks. 

The  following  section  states  and  proves  the  fundamental  theorem  for  one  node 
function  set  and  Section  4.2  shows  how  the  result  also  applies  to  most  other  node 
function  sets. 

4.1  Proof  of  General  Case  using  AOFns 

To  prove  a  problem,  Pi,  to  be  NP- complete,  one  must  take  another  problem,  P2, 
that  is  known  to  be  iVP-complete  and  transform  it  into  Pi.  That  is,  one  must  give 
a  polynomial  time  algorithm  that  can  translate  any  arbitrary  instance  of  P2  into  an 
instance  of  Pi  that  is  true  if  and  only  if  the  instance  of  P2  is  true.  This  algorithm 
is  then  called  a  ‘reduction  from  P2  to  Pi’.  See  [GJ79)  for  an  explanation  of  this 
technique.  Of  course  if  there  is  a  poly-time  solution  to  Pi  there  will  automatically 
be  a  poly-time  solution  to  P2  by  first  applying  the  reduction  algorithm  and  then 
the  solution  algorithm,  but  such  a  composed  procedure  is  presumed  not  to  exist,  so 
the  solution  algorithm  is  presumed  not  to  exist  either. 

The  particular  NP- complete  problem  that  we  will  use  in  the  role  of  P2  is  called 
3SAT.  An  instance  of  3SAT  is  a  expression  in  Boolean  variables  given  in  conjunctive 
normal  form  (i.e.  a  conjunction  of  disjunctions)  in  which  all  the  disjunctions  have 
exactly  3  literals.  A  literal  is  a  logical  variable  or  its  negation.  The  instance  is  said  to 
be  true  (or  satisfiable)  if  the  variables  can  all  be  given  values  such  that  the  whole  log¬ 
ical  expression  is  true.  For  example  the  expression  (xi,x3,x4)(x2,x3,xi)(xi,x2,x3) 
is  satisfied  by  the  assignments  xx  =  0,  x2  =  1,  x3  =  1,  x4  =  0- 
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Theorem  1  Perf^oFns  ls  NP-complete. 

Proof:  by  reduction  from  3SAT.  Let  the  3SAT  problen  be  ( Z,C )  where  Z  is.  a 
set  of  variables  {ft,  ft,  ft,  •  •  •}  and  C  is  a  set  of  disjunctive  clauses  over  them.  Each 
clause  has  3  literals.  For  (Z,C)  to  be  satisfiable,  there  must  be  an  assignment 
Id  :  Z  — »  {0, 1}  such  that  at  least  one  literal  in  each  clause  has  value  1. 

A  formal  construction  is  given  here  for  the  architecture  and  task,  followed  by 
an  expose.  Let  w  =  \Z\  be  the  number  of  variables  and  m  -  \C\  the  number  of 
clauses.  The  3SAT  instance  (Z,C)  is  reduced  to  (A,  T),  an  instance  of  the  loading 
problem,  where 

A  =  (P,V,S,R,  E) 

S  =  {a,  6} 

R  =  V  -  {tf,-,  Xi ,  t/i ,  Zi  :  ft  €  Z}  U  {Cj  :  Cj  €  C} 

P  =  SUK 

E  =  { (a,  u/«),  (a,  Zi),  (b,  u\ ),(&,«,), 

(wi,x,),(wi,yi),(z,,xt),(zi,yi)  :  ft  €  Z} 

U  {(u;,,  Cj)  :  ft  €  Cj}  U  {(zi,C;)  .  ft  €  Cj} 

T  =  {Ii,Ii,h} 

Ix  =  (0  0,  {0  0  0  0)™  0m) 

h  =  (1  1,  (1  1  1  1)"  *m) 

h  =  (0  1,  (*  0  1  *r  lm) 

This  arcane  piece  of  notation  is  explained  in  a  2-stage  reader-friendly  example. 
Stage  1:  For  every  variable  ft  6  Z  construct  the  partial  architecture  and  partial 
task  shown  in  Figure  4.1.  From  item  1  we  know  that  /«,((), 0)  =  /*(0,0)  =  0;  hence 
fx[ 0,0)  =  0  and  /„((), 0)  =  0.  From  item  2  we  know  that  /w(l,l)  =  /^(l,  1)  =  1; 
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hence  fx{  1, 1)  =  1  and  /„(  1, 1)  =  1.  By  comparing  item  2  and  item  3  we  know 

/*(/*(l,l), /*(M))  =  1^0  =  /*(/«  (0,1),  /4M) 

L{  1,1)#/«(0,1)  or/,(l,l)#/,(0,l) 

1  +  /w(0, 1)  or  1  ^  /z(0, 1).  (4'1) 

By  comparing  item  1  and  item  3  we  know 

/„(/w(0,0),/,(0,0))  =0#  1  =  /„(/«(0,l),/*(0,l)) 

/„(0,0)  7^  /„(0,1)  or  A (0,0)  ^  /«(0,1) 

o  7^  /«(0,  1)  or  0  7^  /*(°1  !)•  (4-2) 

And  from  (4.1)  and  (4.2)  we  conclude  /w(0,l)  #  /,(0,1).  We  will  associate  some 
SAT  variable  %  with  the  group  of  nodes  in  this  construction.  For  mnemonic  value 
and  brevity,  let  <£■)  stand  for  “the  value  computed  by  the  w-node  in  the  block 
of  nodes  associated  with  $  when  given  the  input  0  1”.  And  let  (&)  stand  for  its 
negation— i.e.  the  output  from  the  z-node  for  input  0  1. 
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Figure  4.2:  The  composed  construction  for  Theorem  1.  This  example  is  for 
the  single  clause  (ft, ft, ft). 


Stage  2:  For  each  clause  in  the  SAT  system  construct  a  single  node  in  the  second 
layer  of  the  architecture  with  inputs  from  all  nodes  associated  with  its  participating 
literals.  Putting  variables’  nodes  and  the  clause  node  together,  we  get  what  is  shown 
in  Figure  4.2.  It  shows  the  construction  for  an  example  SAT  system  consisting  of 
only  one  clause  (ft,  ft,  ft).  Observe  that  each  item  consists  of  the  stimulus  from 
an  item  from  Figure  4.1,  three  replications  of  its  response  (one  per  variable),  and 
another  response  bit  for  the  clause  node  (node  c). 

Claim:  The  constructed  architecture  can  perform  the  task  iff  the  SAT  instance 
is  satisfiable. 

Proof:  Remember  that  /,„(0,0)  =  f2( 0,0)  =  0  in  each  variable  construct.  By 
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inspecting  item  1  and  item  T 

/c (0,0,0;  -  0  7-  I  —  /e((fi),  (?2)>  (fs))- 


Hence 

(?i)  #  0  or  (ft)  7^  0  or  (ft)  #  0, 

which  is  exactly  the  semantics  of  a  disjunctive  clause.  If  n  exists  then  let  {$>  = 
n(f;),  that  is 


OR 

AND 


if  r%)  =  1 

if  n(6)  -  0 


and  = 


and  if  n(&)  =  1 

OR  if  !!($■;}  =  o 


For  all  variables  let  f]x  =  AND  and  />  =  OR,  and  for  the  clause  node  let  fc  =  OR. 
The  reader  is  welcome  to  check  that  this  configuration  performs  the  task. 

Conversely,  if  a  configuration  exists  let  n(^)  =  (f,),  and  observe  ft  =  1  or  ft  =  1 
or  ^  =  1  as  required.  This  proves  the  claim.  □ 


The  extension  to  multi-clause  systems  should  be  clear. 

Thus  we  have  SAT  oc  PerfA0Fns  and  it  is  easy  to  see  that  the  algorithm  for  the 
transformation  runs  in  polynomial  time  (in  fact  linear  time  and  log  space). 

Finally,  it  must  be  demonstrated  that  there  is  a  non-deterministic  machine  that 
can  decide  PerfA0Fns  in  time  polynomial  in  the  length  of  (A,T).  Writing  down 
a  complete  configuration  of  AOFns  takes  one  bit  for  each  node  in  A.  That  the 
configuration  is  correct  can  be  checked  by  evaluating  each  node  function  once  for 
each  item  in  T.  This  takes  time  0{ \V\  x  |T|)  under  the  assumption  that  it  takes 
constant  time  to  evaluate  any  single  /,-. 

This,  and  SAT  cx  PerfAoFns  implies  PerfAoFns  is  IVP-complete.  □ 

This  proof  is  intended  to  be  applicable  to  Perf>  for  7  being  more  than  just 
AOFns.  Hence  it  begins  by  forcing  /„,((), 0)  =  0  and  fw(  1,1)  =  1  (amongst  other 
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things).  Such  could  have  been  assumed  from  the  outset  since  OR(0,0)  =  0  and 
AND(1, 1)  =  1  but  we  chose  not  to  exploit  these  peculiarities  of  AOFns  in  the 
proof.  Regardless  of  the  value  for  fw(0, 1),  one  of  {AND, OR}  will  satisfy  all  the 
requirements,  so  the  proof  is  strong  enough  to  apply  to  AOFns  while  not  being 
specific  to  it. 

This  proof  uses  the  “don’t-care”  symbol,  but  such  is  not  always  a  part  of  the 
learning  protocol  used  in  connectionist  studies.  In  appendix  C  there  is  another 
version  of  the  proof  that  avoids  the  “don’t-care”  by  using  some  extra  signals  and 
nodes.  Hence  this  detail  does  not  strongly  alter  the  nature  of  the  problem. 

4.2  Other  Node  Function  Sets 

The  intent  of  this  section  is  to  demonstrate  that  the  intractability  of  the  performa- 
bility  problem  does  not  depend  much  on  the  particular  node  function  set  being 
used — its  difficulty  remains  for  essentially  all  non-trivial  cases. 

Theorem  1  deals  only  with  AOFns,  but  connectionist  studies  typically  use  LSF ns, 
the  linearly  separable  functions.  LSFns  includes  all  of  AOFns  and,  when  the  num¬ 
ber  of  inputs  to  a  node  is  large,  it  is  considerably  more  powerful.  It  might  seem, 
therefore,  that  this  extra  power  would  make  loading  easier.  Unfortunately,  this  case 
(and  even  LUFns)  is  just  as  hard. 

Corollary  2  For  any  node  function  set  7  such  that  all  members  of  F  are  binary- 
valued  functions,  and  7  3  {AND,  OR},  Perfj  is  NP-hard. 

Proof:  Both  directions  of  the  proof  of  the  claim  in  Theorem  1  require  nodes  able,  at 
least,  to  perform  functions  from  AOFns.  The  reduction  thus  follows  for  any  node 
function  set  that  includes  them.  U 
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Corollary  3  PerfisFns  NP-complete. 


Proof:  NP-hardness  follows  from  Corollary  2,  so  we  need  only  to  show  that  Perf^gf,^ 
is  in  NP.  For  this  to  be  true,  there  must  exist  some  poly-time  way  of  guessing  a  func¬ 
tion  from  LSFns  and  being  sure  that  indeed  it  is  from  LSFns.  If  fan-in  were  bounded 
in  our  model,  then  this  would  be  easy  since  we  could  get  the  non-deterministic  se¬ 
lection  to  be  from  a  fixed  table  of  all  LSFns  up  to  that  input  size.  Without  bounds 
on  fan-in,  this  technique  will  not  work.  One  might  attempt  to  achieve  a  selection 
from  LSFns  by  simply  writing  down  the  weights  that  are  used  in  the  linear  sum, 
but  since  the  weights  are  assumed  to  be  real  (i.e.  of  a  potentially  infinite  number 
of  decimal  places),  this  technique  is  also  inadequate.  However,  Hong  [Hon87j  has 
recently  proved  that  approximations  to  the  weights  are  sufficient  to  encode  any  and 
all  members  of  LSFns.  Specifically,  only  a  polynomial  number  of  digits  are  required 
(polynomial  in  the  fan-in),  and  hence  PerfisFns  *s  NP-complete.  □ 

Muroga  [Mur71,  thm  9. 3. 2.1]  implicitly  proves  the  same  result  about  polynomial 
bounds  on  the  weights  in  LSFns.  It  is  tighter  but  less  direct. 

Corollary  4  PerfmFns  ts  NP-complete. 

Proof:  Again,  IVP-hardness  follows  from  Corollary  2,  but  to  prove  PerfmFns  to  be 
in  NP,  we  must  give  some  format  for  guessing  members  of  LUFns.  It  must  have 
some  poly-time  way  of  writing  down  an  arbitrary  function  and  checking  that  it  is 
in  LUFns. 

To  fully  specify  an  arbitrary  member  of  LUFns  requires  2|pre^l)l  bits  and  hence 
it  takes  exponential  time  to  write  it  down.  (The  statement  of  the  theorem  implies 
no  bound  on  the  fari-in  to  a  node.)  However,  each  node  function  will  be  invoked 
exactly  t  =  |T|  times  in  the  performance  of  the  task;  hence  we  can  specify  a  function 
F  (E  LUFns  by  asserting  a  default  value  (1,  say)  to  cover  most  inputs,  and  then 
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listing  the  exceptional  inputs,  a,  for  which  F(a)  —  0  (of  which  there  are  at  most  £). 
Since  T  has  a  unary  encoding  of  t,  there  is  a  representation  of  F  that  is  polynomial 
in  the  length  of  (A,T),  and  this  means  that  a  function  can  always  be  written  down 
in  poly  time. 

Making  sure  that  such  a  function  is  a  member  of  LUFns  is  trivial  since  all 
binary-valued  functions  are  members.  Hence  Perfj  G  NP  even  when  7  =  LUFns, 
and  PerfLUFns  is  NP- complete.  □ 

LSFns  is  a  special  case  of  the  quasi-linear  functions  (QLFns).  Theorem  3  per¬ 
tains  only  to  discrete,  binary-valued  signals  and  does  not  apply  to  real- valued  quasi- 
linear  functions.  However,  another  theorem  pertains  specifically  to  the  popular 
logistic-linear  functions  (LLFns)  used  in  back-propagation: 

Theorem  5  PerfnFns  is  NP-complete.  □ 


Proof  in  appendix  B.  As  a  corollary,  performability  with  the  more  general  class  of 
quasi-linear  functions,  PerfQiFns  is  also  iVP-hard. 

These  theorems  indicate  that  the  difficulty  in  the  loading  problem  has  very  little 
to  do  with  the  choice  of  node  function  sets.  This  observation  is  strengthened  below 
in  Theorem  12  below  which  states  that  some  tasks  which  are  performable  using  very 
restricted  node  function  sets,  are  difficult  to  load  even  when  that  node  function  set 
is  greatly  expanded.  This  argues  that  the  difficulties  of  loading  will  not  be  overcome 
by  searching  for  ever  more  powerful  node  types. 

We  end  this  chapter  with  a  more  convenient  statement  of  our  main  result: 

Corollary  6  Loading  is  NP-complete. 

Proof:  The  decision  problem  is  JVF-complete,  and  since  being  able  to  solve  the 
search  problem  would  allow  one  to  answer  the  decision  problem,  the  search  prob- 
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lem  must  be  at  least  as  hard. 


□ 


Note  that  no  node  function  set  is  explicitly  mentioned  in  this  corollary.  There 
are  two  ways  in  which  this  makes  the  corollary  technically  loose.  First  is  that  for 
an  absurdly  simple  node  function  set  (e.g.  where  the  set  has  only  one  member), 
the  problem  is  not  iVP-hard.  Second  is  that  deciding  membership  in  an  absurdly 
complicated  node  function  set  (e.g.  where  the  truth-table  representation  of  each 
function  must  name  a  halting  program),  might  not  be  JVP-easy.  However,  in  the 
common  cases,  and  in  other  reasonable  cases  we  explored  the  result  is  robustly  true. 
Because  it  holds  for  any  node  function  set  of  interest,  we  will  hereafter  omit  the 
specification  of  the  node  function  set  in  order  to  imply  generality. 

We  note  that  a  different  proof  of  the  NP- completeness  of  PecfisFns  has  recently 
been  found  by  Blum  and  Rivest  [BR88].  Their  proof  differs  from  ours  in  that 
different  parameters  are  scaled  up.  Our  proof  scales  up  the  size  of  the  architecture 
and  the  number  of  bits  in  the  response  strings  while  keeping  the  number  of  items 
and  the  number  of  bits  in  the  stimulus  strings  constant.  Their  proof  keeps  the  size 
of  the  architecture  and  the  number  of  bits  in  the  response  strings  constant  while 
scaling  up  the  number  of  items  and  the  number  of  bits  in  each  stimulus  string. 

It  might  also  be  noted  that  there  are  node  function  sets  for  which  performability 
can  be  proved  iVP-complete  without  scaling  up  the  size  of  the  task  or  the  length  of 
the  strings  at  all— using  such  a  node  function  set,  machines  of  arbitrary  size  would 
be  unable  to  load  even  a  fixed  amount  of  data! 
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Chapter  5 
SUBCASES 


Our  results  preclude  only  the  broadest,  most  ambitious  interpretation  of  the  goal 
of  connectionist  learning.  Essentially,  the  goal  we  have  formulated  is  to  find  an 
algorithm  that  is  guaranteed  to  load  any  performable  task  in  any  conceivable  net. 
One  can  imagine  several  ways  to  constrain  the  problem  in  such  a  way  that  the  new 
loading  problem  would  have  some  special  regularity  might  facilitate  its  solution. 
Such  constraints  would  involve 

•  restrictions  on  architectural  design, 

•  restrictions  on  tasks  restrictions,  and/or 

•  different  criteria  of  success. 

For  most  such  sub-cases,  our  theorem  says  nothing. 

This  section  discusses  several  ways  to  define  sub-problems  and / or  different  prob¬ 
lems  that  may  be  easier  to  solve  than  the  general  loading  problem  formulated  above. 
Interspersed  amongst  these  comments  are  several  corollaries  to  the  above  proof  that 
state  further  negative  results. 
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5.1  Architectural  Constraints 


First,  Theorem  1  is  a  statement  about  networks  and  tasks  m  general ,  but  there 
may  be  large  useful  classes  of  networks  (defined  by  some  design  restrictions)  where 
loading  a  task  would  always  be  achievable  in  polynomial  time.  It  has  been  an 
empirical  observation  that  although  some  algorithms  (notably  back-propagation) 
work  well  in  nets  that  have  only  a  few  levels  intervening  between  input  posts  and 
output  posts,  they  work  much  slower  in  deep  nets.  One  might  be  tempted  to 
infer  that  shallow  nets  would  be  intrinsically  easier  to  load.  By  examining  the 
construction  in  the  above  proof  we  see  this  is  not  so.  The  construction  uses  only  2 
layers  and  yet  an  algorithm  for  loading  it  was  shown  to  be  equivalent  to  an  algorithm 
for  solving  3SAT.  Hence: 

Corollary  7  Loading  is  NP-complete  even  when  the  architectures  are  restricted  to 
be  of  depth  <  2  and  of  fan-in  <3.  n 


Rather  than  limit  the  maximum  depth  or  fan-in,  what  is  more  likely  to  help  is 
a  restriction  that  sets  a  minimum  depth  (say  as  a  function  of  the  width  of  the 
net),  or  a  minimum  fan-in,  because  this  forces  a  minimum  number  of  degrees  of 
freedom  everywhere.  Since  experimental  evidence  seems  to  contradict  both  these 
suggestions,  it  would  be  important  to  resolve  the  issue. 

Other  architectural  design  constraints  have  been  explored.  As  a  Erst  piece  of 
analysis,  we  have  some  examined  issues  in  shallow  networks  that  have  gross  struc¬ 
ture  extending  through  their  width.  The  results  are  interesting  and  substantial 
enough  to  warrant  a  separate  chapter.  See  Chapter  6. 

One  avenue  of  freedom  usually  not  exploited  by  connectionist  learning  schemes 
is  to  alter  the  architecture  as  learning  proceeds.  When  carried  to  extremes,  this 
would  amount  to  an  exercise  in  arbitrary  circuit  design,  rather  than  in  connectionist 
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learning,  but  adhering  rigidly  to  the  starting  architecture  may  be  just  too  constric¬ 
tive;  somewhere  between  these  two  extremes  there  may  be  a  balance  that  combines 
the  best  of  both  worlds.  Valiant  and  others  [Val84,KLPV87]  have  initiated  the 
study  of  what  can  be  feasibly  learned  using  total  freedom  of  connectivity  within 
a  certain  class  of  architectures.  For  example,  their  /i-exp cessions  are  the  same  as 
tree-shaped  architectures  that  use  AOFns. 

It  is  conceivable  that  the  difficulties  in  loading  stem  specifically  from  the  non- 
recurrence  of  the  nets  and  the  fact  that  all  their  ‘knowledge’  about  a  stimulus  must 
be  elicited  in  one  single  evaluation  of  each  node  function.  If  so,  then  a  more  reason¬ 
able  model  of  network  memory  might  involve  storing  data  as  cycles  in  state-space 
where  the  power  of  attractor  dynamics  could  be  exploited  to  make  loading  easier 
(albeit  at  the  cost  of  more  expensive  retrieval).  Such  would  be  a  large  departure 
from  our  model  but  there  are  plenty  of  pitfalls  there  too;  Porat  [Por87]  proves  that 
in  such  a  system  the  problem  of  deciding  just  if  a  configured  network  stabilizes  or 
cycles  is  IVP-hard.  See  also  [God87,Lip87]. 

5.2  Task  Constraints 

Next,  our  formulation  of  the  learning  problem  may  be  inappropriate  in  that  it  re¬ 
quires  a  network  to  be  able  to  load  too  large  a  class  of  tasks.  By  using  performability 
as  the  decision  problem,  we  are  in  effect  defining  the  task  class  in  terms  of  the  archi¬ 
tecture  itself  and  asking  that  any  architecture  A  be  able  to  load  any  task  in  the  set 
pA  =  {T  :  3F  3  Mf  2  T}.  But  it  is  not  necessary  to  expect  an  architecture  to  be 
able  to  load  all  of  these  tasks.  From  a  practical  point  of  view,  all  that  is  necessary 
is  that  it  be  able  to  perform  and  load  some  useful  class,  T ,  of  tasks.  Obviously,  it 
is  necessary  that  T  t  PA,  and  the  results  herein  show  that  it  is  too  ambitious  to 
have  T  =  PA  for  arbitrary  A.  However,  there  are  many  ways  to  define  T  so  as  to 
exclude  some  tasks  in  PA,  thus  possibly  leading  to  a  loadable  class.  It  would  be 
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useful  to  be  able  to  characterize  just  what  class  of  tasks  a  network  could  learn,  or 
conversely,  to  be  able  to  determine  whac  types  of  architectures  could  learn  a  given 
class  of  tasks. 


Our  main  theorem  has  implications  for  the  restricted  classes  of  monotonic  tasks, 
small  tasks,  and  tasks  that  are  performable  using  very  small  and  simple  node  func¬ 
tion  sets. 

Define  a  >  <5  to  mean  that  every  element  of  the  binary  vector  a  is  a  1  if  its 
corresponding  element  in  binary  vector  6  is  a  1.  A  monotonic  function  is  a  function 
g  such  that 

cr  >:  8  =>  g(o)  >  g{> 5). 


A  monotonic  task ,  T,  is  a  set  of  items  such  that  for  some  monotonic  function  g,  T 
agrees  with  g: 

(a,p)  e  T  p  agrees  with  g{p). 


Corollary  8  Loading  is  NP-complete  even  when  tasks  are  restricted  to  be  mono¬ 
tonic. 


Corollary  9  Loading  is  NP-complete  even  when  there  are  there  are  only  two  bits 
in  the  stimulus  strings  (s  =  jS|  =  2).  ^ 

Corollary  10  Loading  is  NP-complete  even  when  tasks  are  restricted  to  be  of  no 
more  than  3  items.  ^ 


A  more  promising  avenue  is  to  define  the  task  restrictions  in  terms  of  what  is 
performable  by  a  network  that  is  in  some  way  less  powerful  than  the  network  being 
loaded.  One  technique  for  doing  this  uses  the  notions  of  teacher  and  learner  .  A 
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teacher  is  a  network  that  is  used  to  define  the  set  of  tasks  that  a  learner  network 
will  be  required  to  load.  The  word  ‘teacher’  is  to  connote  what  has  to  be  learned; 
it  is  not  to  be  confused  with  some  mechanism  for  facilitating  the  loading  process. 
For  example,  suppose  we  have  a  network,  A ,  that  can  perform  a  task,  T,  using  only 
those  node  functions  in  the  set  §.  And  suppose  that  another  network  of  the  same 
architecture  but  capable  of  using  a  (larger)  node  function  set  7  is  charged  with 
loading  T.  Call  the  first  network  the  teacher  and  the  second  the  learner.  If  Q  C  7 
then  the  tasks  performable  by  the  teacher  will  be  a  subset  of  the  tasks  performable 
by  the  learner.  Is  it  easier  to  decide  performability  of  this  smaller  set  of  tasks? 

To  denote  this  new  question,  the  parameters  for  describing  the  teacher  are  writ¬ 
ten  to  the  left  of  Perf  and  those  for  the  learner  are  written  to  the  right;  the  current 
example  is  denoted  by  gPerfT.  Formally,  it  requires  for  all  architectures,  A,  and  for 
all  tasks,  T,  to  be  able  to  compute  an  output,  d,  such  that 

d  =  1  =*  3F  e  7n  :  T  C 
d  =  o  =»  5n  -TC 

Note  that  in  some  cases  either  answer  would  be  correct,  and  hence  we  call  this  a 
relaxed  decision  problem. 

The  teacher /learner  device  parallels  the  technique  used  by  Pitt  and  Valiant  in 
[PV86,  definition  1.2]. 

The  question  jPerf7  is  exactly  the  original  type  of  question  Per/V. 

The  following  theorem  shows  that  no  advantage  can  be  made  of  extra  node 
function  power  to  load  tasks: 

Corollary  11  cPerfT  is  NP-complete  for  7  and  §  being  any  reasonable  superset 
of  AOFns. 

Proof:  This  follows  by  the  same  argument  used  for  Corollary  2.  Both  directions  of 
the  proof  of  the  claim  on  page  43  in  Theorem  1  only  require  nodes  able,  at  least, 
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to  perform  functions  from  AOFns.  As  long  as  J  includes  AOFns,  one  direction  of 
the  proof  holds,  and  as  long  as  §  includes  AOFns,  the  other  direction  of  the  proof 

holds.  ^ 

Just  to  emphasize  how  a  large  difference  in  node  functionality  makes  no  differ¬ 
ence  in  loading  complexity,  witness  an  extreme  case  of  the  above  corollary: 

Corollary  12  Loading  an  architecture  A  using  LUFns  is  NP-complete  even  when 
the  tasks  are  restricted  to  be  performable  by  A  using  AOFns.  □ 


This  corollary  deals  with  a  type  of  task  restriction,  but  it  also  provides  further 
evidence  that  the  NP- completeness  of  the  loading  problem  does  not  derive  from 
difficulties  inherent  in  the  node  function  set.  Devising  ever  more  powerful  node 
functionality  will  not  overcome  the  intractability  here. 

5.3  Relaxed  Criteria 

Finally,  our  mathematical  question  has  a  very  exacting  criterion  of  success  in  train¬ 
ing:  either  the  machine  performs  perfectly  or  it  doesn’t.  If  the  criterion  was  more 
lenient  then  the  problem  might  be  much  easier.  Some  probabilistic  or  approximate 
criterion  of  learning  might  be  more  appropriate.  Here  is  one  that  won’t  help: 

Corollary  13  Loading  is  NP-complete  even  when  only  67%  of  the  items  are  re¬ 
quired  to  be  retrieved  correctly. 

Proof:  Loading  slightly  more  than  \  of  3  items  is  the  same  requirement  as  loading 
all  3  items. 
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5.4  Summary 

Note  that  all  the  restrictions  mentioned  in  this  section  actually  hold  simultaneously. 


Corollary  14  Loading  is  NP-complete  even  when 

•  the  architectures  are  restricted  to  be  of  depth  <  2  and  of  fan-in  <  3, 

•  tasks  are  restricted  to  be  monotonic, 

•  there  are  there  are  only  two  bits  in  the  stimulus  strings  (s  =  |5|  =  2), 

•  tasks  are  restricted  to  be  of  no  more  than  S  items, 

•  only  67%  of  the  items  are  required  to  be  retrieved  correctly,  and 

•  tasks  are  restricted  to  be  performable  by  AOFns,  although  a  configuration  may 
draw  node  functions  from  LUFns. 


55 


Chapter  6 

SHALLOW  ARCHITECTURES 


The  loading  problem  is  iVF-complete  even  for  networks  of  depth  2,  so  rather  than 
attempting  to  deal  with  deep  nets,  we  shall  limit  our  attention  to  shallow  nets  and 
try  to  identify  additional  constraints  that  yield  tractable  loading  problems.  For 
further  justification  of  this  strategy,  we  quote  Baldi  and  Venkatesh  [BV87]: 

It  is  not  unusual  to  hear  discussions  about  the  tradeoffs  between  the 
depth  and  width  of  a  circuit.  We  believe  that  one  of  the  main  contri¬ 
butions  of  complexity  analysis  is  to  show  that  this  tradeoff  is  in  some 
sense  minimal  and  that  in  fact  there  exists  a  very  strong  bias  in  favour 
of  shallow  (i.e.  constant  depth)  circuits.  There  are  multiple  reasons 
for  this.  In  general,  for  a  fixed  size,  the  number  of  different  functions 
computable  by  a  circuit  of  small  depth  exceeds  the  number  of  those 
computable  by  a  deeper  circuit.  That  is,  if  one  had  no  prior  knowledge 
regarding  the  function  to  be  computed  and  was  given  m  hidden  units 
then  the  optimal  strategy  would  be  to  choose  a  circuit  of  depth  two  with 
the  m  units  in  a  single  layer.  In  addition,  if  we  view  computations  as 
propagating  in  a  feedforward  mode  from  the  inputs  to  the  output  unit, 
then  shallow  circuits  compute  faster.  And  the  deeper  a  circuit,  the  more 
difficult  become  the  issues  of  time  delays,  synchronization,  and  precision 
on  the  computations.  Finally,  it  should  be  noticed  that  given  overall  re¬ 
sponses  of  a  few  hundred  milliseconds  and  given  the  known  time  scales 
for  synaptic  integration,  biological  circuitry  must  be  shallow,  at  least 
within  a  “module”  and  this  is  corroborated  by  anatomical  data. 

We  introduce  the  notion  of  a  support  cone,  which  is  the  set  of  nodes  that  can 
affect  the  behaviour  of  an  output  node.  On  this  is  built  the  notion  of  the  Support 
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Cone  Interaction  (SCI)  graph  of  an  architecture,  which  isolates  computationally 
salient  features  of  an  architecture  by  explicitly  denoting  only  the  overlaps  between 
support  cones.  Finally,  by  applying  a  limit  to  the  size  of  the  support  cones,  we  create 
a  type  of  formal  constraint  that  is  powerful  enough  to  mask  off  the  difficult  issues 
involved  in  loading  deep  nets  without  interfering  with  our  theoretical  investigation 
into  issues  of  width.  We  have  used  the  term  ‘shallow  networks’  to  mean  a  family 
of  networks  whose  maximum  support  cone  size  is  limited  by  some  parameter  but 
w'here  there  is  no  limit  on  the  number  of  nodes.  This  has  the  effect  of  defining  a 
family  of  bounded  depth  and  unbounded  width. 

We  show  that  limiting  the  size  of  the  support  cones  is  not  enough  in  itself  to  make 
loading  tractable.  Indeed,  even  when  attention  is  further  restricted  to  architectures 
whose  SCI  graphs  are  regular  planar  grids  the  problem  is  NP- complete.  Only  when 
additional  constraints  are  added  that  serve  to  prohibit  the  existence  of  large  grids 
within  the  SCI  graph  are  feasible  problems  identified:  polynomial-time  loadable 
architectures  are  found  for  the  case  where  the  SCI  graph  is  of  limited  tree-width. 

6.1  Definitions 

Definition  In  an  architecture  A  =  ( P,V,S,R,E ),  each  output  node  x  e  R  has  a 
support  cone ,  sc(x),  which  is  the  set  of  all  nodes  in  V  that  can  potentially  affect  the 
output  of  that  node;  that  is,  it  is  the  set  of  predecessor  nodes: 

sc  (x)  =  {x}  U  (sc(y)  :  y  €  pre(x)  n  V}. 

The  network  retrieval  behaviour  at  any  particular  output  node  is  determined  by 
(and  only  by)  the  functions  assigned  to  each  node  in  its  support  cone. 

/ 

Definition  A  support  cone  interaction  graph  (SCI  graph)  for  an  architecture,  is 
an  accounting  of  the  interactions  between  support  cones.  It  is  a  graph  with  nodes 
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{z,,  z2,  ■  ■  •  ,zr}  corresponding  one-to-one  with  the  output  nodes,  a..?  U"  iges 

{{zi.Zj)  :  sc (Rt)  n  sc(i?; )  #  0}. 

Definition  A  partial  configuration  for  node  x  is  an  assignment  of  functions  to 
each  node  in  its  support  cone: 

Fx  :  sc(x)  -►  7 . 

A  partial  configuration  for  a  group  of  nodes,  X,  is  an  assignment  of  functions  to  all 
nodes  in  all  of  its  support  cones: 

Fx  ■  (J  sc(x)  7. 

x€X 

Definition  The  support  cone  configuration  space  (sees)  for  output  node  x  is  the 
set  of  all  partial  configurations  for  the  support  cone  of  x. 

Since  we  are  considering  only  binary  functions  of  binary  values  for  each  node  in 
a  finite  graph,  the  size  of  a  sees  is  always  finite. 

Definition  A  family  of  architectures  is  shallow  if  the  size  of  the  largest  sees  in 
each  architecture  is  bounded.  (At  first,  assume  it  is  bounded  by  a  constant;  this 
will  be  loosened  later.) 

Note  that  this  limitation  has  the  effect  of  bounding  the  depth  of  a  network,  the 
maximum  fan-in  to  any  node,  and  the  number  of  different  functions  in  the  node 
function  set,  although  it  does  not  dictate  how  these  things  are  traded  off  against 
each  other. 

The  complete  sees  for  any  node  in  any  architecture  in  a  shallow  family  can  be 
exhaustively  searched  in  constant  time. 
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6.2  Grids  and  Planar  Cases 


This  section  starts  from  our  previous  JVP-completeness  result  on  shallow  architec¬ 
tures  and  tightens  it  to  apply  to  two  progressively  more  constrained  families  of 
shallow  architectures. 

The  proofs  are  extensions  of  the  one  used  for  Theorem  1  so  the  first  thing  we 
do  in  this  section  is  import  the  construction  used  there  and  make  a  minor  change. 
Note  the  construction  in  Figure  6.1a  is  almost  identical  to  the  one  given  earlier  in 
Figure  4.1  (page  42)  except  that  S  -  {a,b,d,e}  instead  of  just  {a,  b}.  The  tasks 
remain  functionally  the  same,  however,  because  input  a  is  identical  to  input  d ,  and 
b  is  identical  to  input  e. 

To  make  the  next  proofs  easy  to  read,  a  pictorial  notation  for  architectures  and 
tasks  is  used  which  eliminates  excessive  formality.  In  Figure  6.1a  the  network  has 
been  depicted  on  the  page  so  that  information  flowed  across  the  plane  of  the  page, 
as  is  customary  in  the  connectionist  literature.  Figure  6.1b  shows  an  alternate  view 
of  this  same  architecture,  the  plan  view ,  which  is  a  view  “from  above”.  If  a  network 
is  drawn  in  such  a  way  that  during  retrieval  the  Stimulus  originates  above  the  page, 
information  flows  into  the  page,  and  the  Response  arrives  below  the  page  then  the 
network  is  drawn  in  plan  view.  The  items  shown  in  Figures  6.1a  and  6.1b  are  also 
different  representations  of  the  same  task. 

As  in  the  proof  for  Theorem  1,  each  clause  in  the  3SAT  system  corresponds  to 
a  single  node  in  the  second  layer  of  the  constructed  architecture  with  inputs  from 
all  nodes  associated  with  its  participating  literals.  Putting  all  the  variables’  nodes 
together  with  the  clause  node  yields  something  like  what  is  shown  in  Figure  6.2.  It 
is  a  plan-view  re-representation  of  Figure  4.2. 

As  before,  the  largest  support  cone  in  this  construction  has  only  4  nodes  in  it 
and  the  largest  fan-in  is  only  3,  so  the  largest  sees  is  of  limited  size.  Hence  this 
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item  1  : 
item  2  : 
item  3  : 


a  b  d  e, 
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(0  10  1, 


w  x  y  z 
0  0  0  0) 
1111) 
*01*) 


(a)  The  construction  for  each  variable  in  the  SAT  system.  The  architecture 
shown  on  the  left  is  drawn  in  the  classic  side  view.  The  3  items  in  the  task 
as  shown  on  the  right.  Zeroes  and  ones  are  desired  responses;  the  asterisks 
are  ‘don’t  cares’.  This  construction  is  nearly  identical  to  the  one  used  in 


architecture 


«  =  |S|  = 

4  rather  than  2. 

item  1 

item  2 

E  0 

H  i 

0  ]0, 

i  H 

item  3 

Q  o 
i  Q 


(b)  The  plan  view  of  the  construction  for  each  variable  in  the  SAT  system. 
This  is  a  different  representation  of  what  is  shown  in  part  (a)  above.  On  the 
left  is  the  plan  view  of  the  architecture.  Round  nodes  are  first-layer  nodes 
and  each  has  2  external  input  connections  (which  are  not  shown).  Square 
nodes  are  second-layer  nodes  and  have  input  connections  from  the  round 
nodes.  Ail  nodes  have  external  output  connections  (which  are  not  shown). 
The  3  diagrams  on  the  right  are  pictorial  representations  for  the  same  3 
items  as  appear  in  (a)  above.  The  letter  L  stands  for  the  2-bit  input  0  0; 
H  stands  for  1  1;  and  Q  stands  for  0  1.  The  zeroes  and  ones  are  desired 
responses;  the  asterisks  are  ‘don’t  cares’.  Each  character  is  positioned  to 
correspond  to  a  node  as  drawn  in  the  left  diagram.  First  layer  nodes  have 
stimulus  bits  and  required  responses  as  well. 

Figure  6.1:  Plan  view  notation 
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Figure  6.2:  Plan  view  of  the  composed  construction.  It  uses  notation  estab¬ 
lished  in  Figure  6.1.  This  example  is  for  the  single  clause  (u1?  uj,  uj).  At 
top  left  is  the  plan  view  of  the  architecture.  Node  c  is  a  second-layer  node 
that  is  used  to  enforce  the  disjunctive  semantics  of  the  clause.  Below  are 
the  3  items. 


item  3 
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family  of  constructions  fits  the  definition  of  shallow  networks,  and  this  construction 
is  therefore  sufficient  to  prove  what  is  actually  a  looser  version  of  Corollary  7: 

Corollary  15  Loading  shallow  architectures  is  NP-complete.  □ 

Our  first  intuition  after  realizing  this  was  that  the  problem  was  difficult  because 
the  architecture  lacked  any  regular  structure — constraints  in  one  part  of  the  network 
could  immediately  impact  options  in  any  other  part  of  the  network.  Connections 
in  the  architecture  could  reach  and  thereby  propagate  constraints  from  anywhere 
to  anywhere.  To  prevent  this,  we  sought  reasonable  restrictions  to  place  on  the 
SCI  graph  so  that  constraints  generated  in  one  part  of  the  architecture  would  stay 
somewhat  local  to  the  area  in  which  they  originated.  One  such  device  was  to  require 
the  SCI  graph  to  be  planar.  Unfortunately, 

Theorem  16  Loading  shallow  planar-SCI  architectures  is  NP-complete. 

Proof1:  Note  one  incidental  fact  about  the  reduction  used  in  the  proof  for  Theorem 
15— that  the  SCI  graph  for  an  architecture  in  that  family  of  constructions  is  identical 
to  the  plan  view  of  the  architecture  (minus  directions  on  the  edges).  We  will  use  a 
similar  construction  in  this  theorem;  the  architecture  used  will  have  a  planar  plan 
view  and  a  planar  SCI  graph  simultaneously. 

The  proof  of  Corollary  15  can  be  re-employed  for  the  present  theorem  here  so 
long  as  we  can  arrange  for  no  arcs  to  cross  in  the  drawing  of  the  SCI  graph.  This 
is  done  in  the  usual  way  (see  [Lic82]) — we  show  how  to  eliminate  all  crossing  arcs 
without  altering  the  relevant  aspects  of  the  graph.  See  the  ‘crossover  construct  in 
Figure  6.3. 

i- This  proof  employs  a  node  function  set-  which  is  not  linearly  separable,  and  therefore  is  not 
directly  applicable  to  tKe  conventional  connectionist  devices.  However,  there  is  a  more  elaborate 
construction  based  on  an  invention  by  Lichtenstein  [Lic82]  that  holds  for  the  standard  linear  thresh¬ 
old  functions.  See  appendix  D. 
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Figure  6.3:  The  construction  for  crossovers.  The  architecture  is  shown  in 
plan  view  at  left.  The  6  items  shown  at  right  force  /p(0, 1)  =  -’/P<(0,1)  and 

A(o,i)  =  -/,'(o,i). 


Let  the  label  in  a  node  in  this  diagram  also  denote  the  value  emitted  by  that 
node  for  input  0  1  (input  0  1  is  abbreviated  as  a  Q  in  the  item  diagram).  By 
comparing  item  1  with  item  2  deduce  that  p  0  or  p'  7^  0.  By  comparing  item 
4  with  item  5  deduce  that  p  ^  1  or  p1  ^  1.  From  these  it  follows  that  p  7^  p. 
Similarly,  by  comparing  item  2  with  item  3,  and  item  5  with  item  6,  it  follows  that 
q  ±  q'.  Thus  p'  is  a  copy  (albeit  a  negative  copy)  of  p,  and  q'  is  a  (negative)  copy 
of  q.  The  copies  can  be  re-inverted  using  the  construction  in  Figure  6.1b.  Thus  the 
information  about  p  and  q  ‘pass  through  each  other’  in  the  plane  and  the  techniques 
for  proving  Theorem  15  can  be  used  for  the  present  theorem  as  well. 

Since  there  are  only  a  polynomial  number  of  crossing  points  in  a  graph,  each 
one  can  be  replaced  by  the  (fixed)  amount  of  extra  construction  given  here  and  we 
still  have  a  polynomial  reduction  from  3SAT.  □ 

SCI  planarity  is  not  a  tight  enough  constraint  to  escape  NP-completeness.  In 
fact,  no  kind  of  local  topology  constraint  on  the  SCI  graph  that  is  still  open  to  2- 
dimensional  expansion  seems  to  hold  much  promise.  Define  a  grid  as  a  checkerboard 
graph  on  nodes  x,-j  and  edges  are  either  (x,y,  x,+ij)  or  (x,  j ,  Xij+ 1).  Witness. 

Theorem  A7  Loading  shallow  gnd-SC I  archittcturts  is  NP-complete. 

Proof:  All  the  individual  constructs  in  Figure  6.1b  and  Figure  6.3  can  fit  easily  into 
a  grid  topology.  It  remains  to  show  how  they  can  all  be  connected.  For  this  we 
need  only  show  how  to  transform  one  of  the  arbitrary-shaped  and  arbitrary-lengthed 
arcs  of  Figure  6.2  into  an  equivalent  implication  while  following  grid  lines;  i.e.  how 
a  variable  can  be  propagated  from  one  point  on  the  grid  to  most  any  other  point. 
Using  the  construction  from  Figure  6.1b  we  can  make  a  negated  copy  of  a  variable 
in  a  diagonally  adjacent  node.  Using  the  construction  from  Figure  6.3  we  can  make 
a  negated  copy  of  a  variable  in  a  node  2  places  away  horizontally  or  vertically.  Us¬ 
ing  combinations  of  these,  we  can  copy  a  variable  either  positively  or  negatively  to 
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any  other  node  in  the  grid.  See  Figure  6.4  for  examples.  Thus  any  construction 
for  Theorem  16  can  be  padded  with  extra  nodes  until  it  becomes  a  grid  structure.  □ 

These  grid  SCI  graphs  haye  node  degree  4.  Loading  is  also  NP- complete  when 
the  SCI  graph  is  a  hexagonal  array  (node  degree  3).  Proof  omitted.  When  node 
degree  is  limited  further  to  just  2,  the  SCI  graph  becomes  a  chain  and  the  problem 
is  easy.  Proof  in  the  next  section. 

6.3  Definitions  Again 

Definition  Let  DOM(X)  denote  the  domain  of  the  function  X.  Two  configurations 
F  and  G  are  said  to  be  compatible ,  written  F  =  G,  if  they  have  a  common  extension: 

F  £  G  Vt>  €  DOM(-F)  n  DOM(G)  F(v)  =  G(v) 

Note  that  a  partial  configuration  for  node  a  is  trivially  compatible  with  a  partial 

configuration  for  node  b  if  sc(a)  n  sc(6)  =  0. 

The  union  of  two  configurations  F  and  G  is  defined  when  G  =  H: 

F  =  G  u  H  <=>  dom(T)  -  dom(G)  U  DOM(fT),  F  =  G,F  =  H 

The  usual  notion  of  restrictions  on  functions  is  also  useful: 

F  =  G\a  <*=>  dom(F)  =  A,  F  =  G,  dom (G)  D  A 

Definition  A  correct  partial  configuration,  F,  for  node  x  is  a  partial  configura¬ 
tion  with  the  property  that  for  any  extension  of  F  to  a  complete  configuration 

F,  Mf  at  node  x  agrees  with  the  corresponding  response  bit  over  all  items  in  the 

/ 

task.  A  correct  partial  configuration  for  a  group  of  nodes  contains  a  correct  partial 
configuration  for  each  node  in  the  group. 
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Figure  6.4:  Example  task  designs  for  propagating  variables.  Each  diagram 
shows  the  plan  view  of  a  2-layer  architecture.  The  horizontal  and  vertical 
arrows  indicate  the  effect  of  the  task  construct  in  Figure  6.3;  the  diagonal 
arrows  indicate  the  effect  of  the  task  construct  in  Figure  6.1b.  These  four  di¬ 
agrams  illustrate  that  a  variable  or  its  negation  can  be  propagated  through¬ 
out  a  grid  architecture  from  one  first-layer  node  to  any  other  first-layer 
node. 


Figure  6.5:  An  example  graph  with  bandwidth  4.  Note  that  the  gross 
structure  is  lineal,  which  could  be  extended  indefinitively  without  increasing 
the  bandwidth.  The  layout  for  the  graph  is  given  by  the  subscripts  to  the 
node  labels. 


Definition  The  bandwidth  of  a  graph  measures  the  greatest  distance  that  any  two 
adjacent  vertices  in  a  graph  must  be  separated  when  the  nodes  are  strung  out  in  a 
straight  line.  Let  G  be  a  graph  with  nodes  V (G)  and  edges  E(G).  Let  a  one-to-one 
function  £  :  V  —*  {1,2, . . . ,  |V(G)|}  be  called  a  layout  of  G.  Then  G  has  bandwidth 
b  if  there  exists  some  layout,  £,  such  that  for  all  (x,y)  6  E,\£[x)  -  £(y)|  <  b.  An 
example  graph  and  its  layout  are  given  in  Figure  6.5. 
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Definition  The  tree-width 2  of  a  graph  is  defined  by  [RS86]  in  the  following  way: 
Let  G  be  a  graph.  A  tree-decomposition  of  G  is  a  family  {Xj  :  i  G  /}  of  subsets  of 
V(G),  together  with  a  tree  T  with  V (T)  =  I,  which  have  the  following  properties: 

.  U {Xt:ieI}  =  V{G) 

•  Every  edge  of  G  has  both  its  ends  in  X,  for  some  tel. 

•  For  i,j,k  e  /,  if  j  lies  on  the  path  in  T  from  i  to  k  then  X,  n  Xk  Q  X,. 

The  width  of  a  tree-decomposition  is  max{jX,|  —  1  :  *  €  /}.  The  tree-width  of  G  is 
the  minimum  width  over  all  possible  tree-decompositions. 

As  examples  of  this  concept,  trees  and  forests  have  tree-width  <  1,  and  series- 
parallel  graphs  have  tree-width  <  2.  For  n  >  1,  the  complete  graph  Kn  has  tree- 
width  n  -  1,  and  the  n  x  n  rectangular  grid  (as  in  Theorem  17)  has  tree-width  n. 
The  bandwidth  of  a  graph  is  never  smaller  than  its  tree-width,  but  it  is  known  that 
trees  (tree-width  1)  have  unbounded  bandwidth  even  when  their  fan-in  is  limited 
to  3  [GGJK78].  Figure  6.6  shows  an  example  graph  that  has  tree-width  4. 

6.4  Tree-Width  Constraints 

The  theorems  above  deal  with  constrained  families  of  architectures  and  assert  that 
the  loading  problem  is  intractable  for  those  families.  This  section  examines  a  differ¬ 
ent  type  of  constraint  and  reports  polynomial-time  algorithms  for  them,  which  we 
-The  armwidth  of  a  graph  is  a  generalization  of  bandwidth  which  we  developed  and  de¬ 
fined  in  terms  of  a  pebbling  game  or  a  vertex-elimination  procedure.  During  preparation 
of  this  document  we  discovered  that  the  notion  has  been  independently  developed  by  others 
[ ACP87.AP88, WHL85,CK87].  The  treatment  given  by  Robertson  and  Seymour  [RS86]  is  more 
appealing  than  our  definition  for  the  purposes  of  the  proof  below  so  we  use  their  notation  and 
name  for  it.  Our  definition  can  be  found  in  a  technical  report  [Jud88a|  proving  its  equivalence  to 
“embeddings  in  partial  fc-trees”  and  “tree-width”. 
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Figure  6.6:  An  example  graph  with  tree-width  4.  Its  bandwidth  is  as  16. 
Note  that  the  gross  structure  is  a  tree,  but  each  arm  in  this  tree  is  not  a 
simple  path  graph  as  a  true  tree  would  have,  but  is  a  ‘fatter’  structure.  Each 
of  these  fat  arms,  taken  independently,  is  a  graph  with  bandwidth  4. 
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interpret  as  tractable.  We  begin  with  an  example  family  of  networks  called  colum¬ 
nar  lines.  These  architectures  are  described  graphically  in  Figure  6.7a.  They  are  of 
some  fixed  depth  (4  in  the  example  shown)  and  of  unbounded  width,  so  they  qualify 
as  a  shallow  family.  Their  fish-net  pattern  of  connectivity  gives  rise  to  the  family 
of  SCI  graphs  depicted  in  Figure  6.7b.  Regardless  of  the  width  of  the  architecture, 
its  SCI  graph  has  a  bandwidth  (and  tree-width)  of  3  (one  less  than  the  depth  of  the 
net). 

Observation:  Columnar  line  architectures  can  be  loaded  in  polynomial  time. 

Proof  sketch:  Create  a  graph  with  a  collection  hi,  h2,  h3, . . .  of  sets  of  nodes, 
where  a  node  h\  stands  for  the  ith  correct  partial  configuration  for  the  support 

cone  of  the  kth  output  node.  Then  add  edges  {h\,h[+l)  whenever  h\  =  hJk+1.  A 
solution  to  the  loading  problem  corresponds  to  a  connected  path  from  some  member 
of  hi  to  some  member  of  h2  to  some  member  of  h3  and  so  on  to  the  end.  Finding 
such  a  path  requires  only  polynomial  time.  ^ 


The  next  theorem  generalizes  the  previous  observation. 

Theorem  18  Loading  shallow  architectures  whose  SCI  graphs  are  of  limited  tree- 
width  can  be  accomplished  in  polynomial  time,  provided  that  a  tree-decomposition  is 
given  that  exhibits  the  required  width. 

Proof:  Let  T  and  {X,  :  i  6  V'(T)}  be  the  tree-decomposition  of  the  SCI  graph. 
Let  r,rx,r2,...  stand  for  subtrees  of  T.  Let  region(r)  =  U{X,  :  i  €  V  (r)}.  Let 
the  set  of  all  correct  partial  configurations  for  the  group  of  architectural  output 
nodes  corresponding  to  a  group,  X ,  of  nodes  in  the  SCI  graph  be  denoted  scpc[X]. 
Any  member  of  scpc[region(T)]  is  a  solution  configuration.  Let  the  root  node  of  a 

subtree  r  be  denoted  root[r). 

The  following  recursive  dynamic  programming  subroutine  has  access  to  an  ar- 
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(b)  The  SCI  graph  for  the  columnar  line  architecture  of  (a)  above.  Each  node 
corresponds  to  an  output  node  of  the  architecture.  Arcs  occur  wherever  their 
associated  support  cones  overlap.  Regardless  of  the  length  of  this  graph,  it 
has  bandwidth  3. 


Figure  6.7:  Columnar  line  architectures  and  their  SCI  graphs. 
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chitecture,  its  SCI  graph,  and  the  tie'  exposition  for  the  graph,  and  it  takes 
some  subtree  of  T  as  an  argument: 

SOLVE(r): 

for  every  immediate  subtree  of  r 
calculate  5y  <—  SOLVE (r;) 
calculate  5  <—  scpc[Xroot(r)] 
calculate  S  *—  {F  :  F  €  S,Vj3F:  €  Sy  F  =  F:} 
return  5 

We  claim  for  any  given  subtree,  r,  that  every  member  of  the  returned  set 
5  sOLVE(r)  has  an  extension  that  is  correct  for  all  of  r;  and  that  all  correct 
configurations  for  t  must  be  extensions  of  some  member  of  S. 

Claim:  3 F  E  SOLVE(r)  3 F  €  scpc(region(r)]  F  =  F \xrnoHT}- 

Proof,  by  induction  on  the  height  of  r.  For  the  basis  case  where  t  is  a  single  leaf 
node,  £,  SOLVE  returns  5  =  5  =  scpc[X«]  so  the  claim  is  true.  For  the  inductive  step, 
assume  the  claim  true  for  any  subtrees  tut2,t3,  ...  and  consider  a  deeper  subtree, 
r+ ,  consisting  of  a  root  node,  h,  and  subtrees  ri,r2,r3, . . .  immediately  below  it.  Say 
s=  soLVE(r+).  Then  F  6  scpc[h],  and  Vj’3.Hy  E  SOLVE(ry)  such  that  F  =  H:,  by 
the  calculation  of  5.  So  3 Fj  6  scpcfregionfo)]  and  Hj  =  F} \x . .  by  the  inductive 

assumption. 

Now  by  definition  of  the  tree-decomposition,  DOM  (FT,)  2  DOM(Fy)  n  DOM(.F). 
So  3 Ff  =  Fj  U  F  E  scpc[region(r;)  U  Xh\.  It  remains  to  show  that  all  the  F+  are 
mutually  compatible;  this  must  be  so  because  a  path  from  one  subtree  to  any  other 
must  pass  through  the  root  h.  Hence  DOM(Ft+)  Pi  DOM^*)  C  DOM(/)  for  any  i,j, 
and  3 F+  =  FuUj{Fy}  €  scpc[region(r+)].  This  proves  the  =>  direction. 

Conversely,  if  3 F+  6  scpc[region(r+)]  then  the  (exhaustive)  algorithm  must  find 
F  =  F+\x,  •  This  completes  the  direction  and  proves  the  claim. 

To  determine  if  a  solution  configuration  exsits  for  the  whole  network,  run  this 
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algorithm: 

Pick  any  node  in  T  to  be  the  root 
Calculate  5  SOLVE(T) 

If  S  =  0  then  reject  else  accept 

Any  member  of  S  indicates  the  presence  of  a  solution  configuration  so  this 
algorithm  accepts  if  and  only  if  the  task  is  performable. 

Finally,  we  must  show  that  g(n),  the  running  time  of  this  algorithm,  is  poly¬ 
nomial.  Consider  first  the  running  time,  ffi(n),  of  the  non-recursive  parts  of  the 
subroutine  SOLVE.  This  is  0(|scpc[X]|),  which  is  exponential  in  the  size  of  X,  but 
since  the  size  of  X  is  limited,  execution  time  is  also  limited.  So  gi  =  0(1).  There 
are  a  polynomial  number  of  nodes  in  T,  bounded  by  (n  Choose  k)  =  0(nk),  if  not 
by  something  linear.  The  algorithm  invokes  SOLVE  once  per  node  in  T,  so  total 
time  is  g(n)  =  0(nk)  x  0(1).  ^ 

Note  that  this  theorem  holds  even  if  we  loosen  the  definition  of  shallow  archi¬ 
tectures  so  that  the  largest  sees  size  is  polynomial  in  n  (as  opposed  to  being  a 
constant).  In  such  a  case,  gi{n)  and  g{n)  =  0{n  x  Sl(n))  are  still  polynomial. 

Theorem  18  refers  to  ‘limited’  tree-width  and  was  worded  to  imply  “limited  by  a 
constant”,  but  this  is  over-strong.  Consider  a  family  of  architectures  characterized 
only  by  a  growth  function  G{n )  for  the  tree-width  of  its  SCI  graph.  The  theorem 
is  worded  for  the  case  G{n)  =  0(1),  but  it  holds  true  for  the  case  G{n)  =  O(logn) 
because  gt  is  only  exponential  in  logn,  which  means  that  it  is  polynomial.  So  g  is 
still  polynomial  as  well. 

Now  remember  that  when  G(n)  =  0{n)  the  loading  problem  is  NP- complete 
(since  this  is  a  non-constraint— the  tree-width  of  any  graph  of  n  nodes  is  at  most 
n).  These  bounds  leave  a  gap  between  O(logn)  and  0(n)  which  can  be  narrowed 

somewhat: 
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Theorem  19  For  shallow  architecture  families  with  a  growth  the  tree- 

width  of  their  SCI  graph  G{n)  =  nn(1)  =  n£,  loading  is  NP-cota . 

Proof:  Take  an  arbitrary  instance  of  3SAT  and  perform  the  reduction  as  in  The¬ 
orem  15.  Consider  the  graph  defined  on  the  3SAT  instance  which  has  a  node  for 
every  variable,  a  node  for  every  clause,  and  edges  connecting  variable  nodes  to  all 
the  clause  nodes  they  participate  in.  If  this  graph  is  of  size  n  and  tree-width  w  (and 
w  <  n  always),  then  the  constructed  instance  of  loading  will  have  size  0{n)  and 
tree-width  w.  Now  pad  the  construction  with  enough  isolated  nodes  to  bring  it  up  to 
size  n'  =  G~l{n).  This  will  not  change  the  tree-width  of  the  loading  instance  but  it 
will  ensure  that  w  <  G{n'),  thus  satisfying  the  criterion  for  membership  in  the  fam¬ 
ily.  Since  G  is  polynomial,  G-1  is  also  polynomial.  No  matter  how  small  e  is,  as  long 
as  it  is  greater  than  0  there  is  a  polynomial-sized  reduction  from  SAT  to  loading.  □ 

This  narrowed  window  of  bounds  hangs  on  the  tree-width  constraint  alone  and 
is  therefore  common  to  many  combinatorial  search  problems,  not  just  the  loading 
problem. 

Theorem  18  stipulates  that  the  tree-decomposition  of  the  SCI  graph  must  be 
given  as  input  to  the  problem  because  in  general  determining  minimum  tree-width 
is  an  NP-complete  problem  in  itself  [ACP87],  This  is  probably  not  a  problem  for 
connectionists  because  the  network  design  methodologies  we  hope  to  find  would 
presumably  be  amenable  to  easy  a  priori  structural  analysis.  (Assume  the  network 
does  not  change  its  connectivity  during  use.)  However,  if  this  theorem  were  to  be 
exploited  in  a  direct  implementation  it  does  imply  that  the  nodal  learning  rules 
would  have  to  be  aware  of  the  structure  of  the  SCI  graph,  i.e.  knowledge  of  the 
tree-decomposition  would  have  to  be  ‘wired  in’  to  the  network  somehow. 
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6.5  Additional  Comments 

The  algorithms  given  above  are  tor  purposes  of  demonstrating  the  polynomial-time 
complexity  of  the  problems.  They  are  not  intended  to  have  any  neural  plausibility. 
The  running  time  constants  could  be  markedly  improved  in  the  algorithm  given, 
but  note  that  the  running  time  is  linear  in  the  size  of  the  architecture  and  in  task 
size.  This  problem  can  therefore  be  added  to  the  list  [MS81,CES81]  of  NP- complete 
problems  that  become  easier  with  diminishing  bandwidth.  That  characterization 
may  now  obsolete,  though,  because  it  seems  all  of  those  results  can  be  re-cast  in 

terms  of  the  weaker  notion  of  tree-width. 

By  limiting  the  size  of  the  sees  in  all  theorems  above,  we  have  finessed  the 
whole  issue  of  how  the  loading  problem  gets  more  difficult  with  depth.  This  trick 
has  allowed  us  to  focus  on  the  issues  arising  from  expansion  of  the  width  of  an 
architecture.  But  putting  individual  limits  on  the  sees  size  and  on  the  tree-width 
is  unnecessarily  strong.  The  real  constraint  required  by  the  proof  of  Theorem  18  is 
only  that  the  sepe  for  any  tree-decomposition  set,  X,,  be  calculable  in  a  polynomial 

amount  of  time. 

We  have  ignored  the  possibility  of  an  efficient  search  for  correct  partial  configu¬ 
rations  and  have  chosen  here  to  enumerate  all  possibilities.  We  have  dismissed  this 

particular  inquiry  as  being  a  “depth  issue”. 

We  have  studied  fixed-depth  architectures  partly  because  they  are  easy  enough 
to  analyze,  but  they  are  of  interest  because  of  a  possible  correspondence  with  cor¬ 
tical  structures.  Certain  parts  of  the  brain  (e.g.  visual  cortex  [HW79])  are  quite 
shallow  compared  to  their  great  width,  and  the  direction  of  information  flow  is 
predominantly  unidirectional  along  the  shallow  axis.  Connections  are  more  or  less 
localized  in  3D  space  surrounding  a  neuron.  Of  course  real  cortical  structures  are 
complicated  by  many  connections  and  other  specifics  not  modelled  here,  but  we  feel 
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that  the  process  of  developing  a  theory  of  how  such  structures  work  could  benefit 
by  analyzing  a  few  judicious  constraints  at  a  time.  The  constraints  chosen  here 
are  an  approximation  to  what  seem  to  be  the  major  computatior  al  aspects  of  some 
cortical  structures. 
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Chapter  7 


MEMORIZATION  AND 
GENERALIZATION 


One  would  hesitate  to  use  neural  networks  just  to  memorize  and  store  data  because 
it  is  probably  not  economical  at  all— there  are  many  other  engineering  techniques 
that  are  strong  competitors  for  that  honour.  But  one  common  motivation  for  study¬ 
ing  neural  networks  is  that  they  can  generalize,  and  thus  perform  something  of  great 
value  beyond  mere  storage. 

The  issue  of  generalization  does  not  seem  to  be  the  primary  concern  of  this 
thesis.  However,  we  state  in  this  chapter  that  before  the  issue  of  generalization 
can  be  addressed,  the  memorization  problem  must  first  be  solved;  hence  our  results 
about  memorization  have  a  direct  bearing  on  the  other  issue. 

Following,  we  give  several  statements  of  the  same  idea;  the  reader  who  accepts 
any  one  of  the  arguments  might  skip  the  others. 

Statement  1  We  have  shown  that  a  network  cannot  always  remember  all  the 
items  that  is  has  seen.  One  should  therefore  not  expect  it  to  always  be  able  to 

extend  its  knowledge  to  things  it  has  not  seen. 

/ 

Statement  2  When  specifying  what  is  meant  by  ‘generalization’,  one  could  re¬ 
quire  that  the  chosen  function  agree  in  all  places  with  the  given  data,  or  one  might 
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a}Io*  borne  degree  of  deviation  from  the  given  data.  In  the  case  where  the  allowed 
generalisations  must  all  be  consistent  with  the  given  task,  our  results  are  directly 
applicable,  showing  that  consistency  is,  in  general,  too  hard  to  reliably  achieve.  The 
business  of  finding  regularities  in  data  and  generalizing  from  them  depends  totally 
on  the  embedded  problem  of  simply  remembering  data. 

Statement  3  In  the  case  where  the  application  could  tolerate  ‘generalizations 
that  need  not  be  completely  consistent  with  the  given  data,  our  results  are  some¬ 
times  less  directly  relevant.  But  Corollary  13  is  strong  enough  to  apply  to  some  such 
situations:  Even  if  you  allow  a  loading  system  to  alter  the  responses  on  anything 
less  than  1/3  of  the  items  (allowing  the  system  to  select  which  items  and  what  to 
change  them  to),  it  is  still  NP- complete  to  achieve  consistency  with  the  rest. 

Statement  4  When  one  is  given  a  small  sampling  of  items  and  asked  to  find  a 
configuration  that  is  consistent  with  those  items,  there  are  typically  a  vast  number 
of  candidate  configurations.  The  notion  of  “good”  generalization  corresponds  to 
making  an  “appropriate”  selection  from  amongst  this  field  of  options.  The  definition 
of  “appropriate”  is  of  course  going  begging  here.  But  our  NP- completeness  theorems 
indicate  that  it  is  too  difficult  to  identify  even  a  single  configuration  from  this  field  of 
candidates.  Hence  the  definition  of  “appropriate”  is  of  little  concern.  Regardless  of 
how  one  might  prefer  to  define  generalization,  consistency  is  the  nub  of  the  problem. 

Statement  5  A  system  that  learns  and  generalizes  from  what  it  learns  is  often 
treated  in  a  two-phase  experimental  paradigm.  The  first  phase  is  called  the  training 
phase,  and  in  it  some  subset  of  items  is  selected  (by  the  experimenter)  from  a  task 
and  presented  to  the  system.  The  second  phase  is  called  the  testing  phase.  In  it 
some  subset  of  items  (presumably  disjoint  from  the  training  set)  is  selected  from 
the  task  and  the  system  is  asked  to  induce  what  the  responses  should  be.  Of  course 
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the  performance  of  the  system  wiil  be  sensitive  to  how  representative  the  training 
set  is  of  the  overall  task  and  how  complete  it  is.  Also,  it  will  be  sensitive  to  how 
representative  the  testing  set  is  of  the  overall  task.  Amongst  the  community  using 
this  paradigm  there  is  a  widely-held  meta-theorem  which  says  that  the  better  a 
system  does  on  the  training  set,  the  better  it  wiil  do  on  the  test  set.  And  this 
observation  would  have  us  concentrate  on  solving  the  memorization  problem;  poor 
performance  in  memorization  bodes  for  poor  performance  in  generalization. 

Statement  6  The  representativeness  of  the  training  set  and  the  representative¬ 
ness  of  the  testing  set  are  very  subjective  quantities.  Hence  the  two-phase  exper¬ 
imental  paradigm  can  give  erratic  and  non-rigorous  results.  Valiant’s  definition 
of  learnability  has  an  ingenious  mechanism  for  handling  all  of  these  quantities  in  a 
standard  mathematical  way  that  utilizes  a  probabilistic  criterion  of  success  in  learn¬ 
ing  and  generalization.  See  Figure  3.2,  page  26.  Rather  than  arbitrarily  choosing 
a  training  set  in  advance  (which  is  open  to  many  vagaries  and  biases),  he  selects 
a  training  set  by  randomly  choosing  items  according  to  some  unknown  a  priori 
distribution  over  them.  Hence  the  make-up  of  the  training  set  is  objective,  albeit 
probabilistic,  and  it  is  biased  only  by  the  distribution.  He  also  selects  a  testing 
set  in  the  same  way,  for  the  same  reason.  And  since  the  same  distribution  is  used 
in  both  cases,  the  training  set  is  an  unbiased  sampling  of  the  testing  set  (and  vice 
versa).  Furthermore,  the  size  of  the  training  set  is  not  determined  by  the  experi¬ 
menter  either — it  becomes  a  decision  of  the  algorithm  how  many  items  to  sample. 
The  testing  set  is  in  effect  the  whole  task,  but  each  item  is  weighted  by  its  relative 
probability. 

The  definition  then  requires  that  the  system  usually  be  correct  for  most  items 
in  the  test  set.  ‘Usually’  is  defined  by  a  confidence  prooability  parameter,  <5,  and 
‘most’  is  defined  by  an  accuracy  probability  parameter,  e.  The  criterion  of  success 
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A  is  a  design  class  of  architectures. 

A  is  an  architecture. 
p  is  a  polynomial. 

B  is  an  algorithm. 
e,S  are  probabilities. 

F,  G  are  configurations  for  A. 

Mp  is  the  behaviour  of  A  when  configured  with  F. 

T  C  {(a,p))M^(cr)  =  p)  is  a  task. 

D  is  a  probability  distribution  over  task  items. 

(3p,B)  such  that  (V.4  €  /)(VF  for  A) 
(VT  C  M£)(VD  over  T){Ve,6  >  0) 

B  halts  in  time  p(|A|,  \T\,  1/e,  1/6)  with 
output  G  that  with  probability  >  1  —  6 
has  property  f 


“A  can 
generalize” 


Figure  7.1:  A  definition  of  generalization  in  networks 


requires  that  the  learning  algorithm  terminate  in  time  that  is  polynomial  in  1/6  and 
1/e  for  any  given  6,  e  >  0.  Obviously  if  the  algorithm  terminates  in  polynomial  time, 
then  it  can  afford  to  sample  only  a  polynomial  number  of  items,  but  the  system 
is  granted  a  bigger  budget  of  time  whenever  more  confidence  or  more  accuracy  is 
desired. 

This  definition  has  been  examined  by  Blumer  et  al  [BEHW87]  and  they  prove 
that  the  probability  that  all  consistent  hypotheses  have  error  at  most  e  is  larger  than 
1  _  (l  _  6)mr?  where  m  is  the  number  of  samples  and  r  is  the  number  of  hypotheses 
in  the  space  of  all  hypotheses.  This  again  is  an  endorsement  of  consistency— if  you 
can  be  true  to  the  training  set,  you  will  be  true  to  the  testing  set. 
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Statement  7  We  have  utilized  Valiant’s  e,  6  ploy  in  our  definition  of  generalization 
for  networks  as  given  in  Figure  7.1.  Using  this  definition,  we  ask  if  the  class,  A ,  of 
all  networks  can  ‘generalize’: 

Corollary  20  Networks  cannot  generalize. 

Proof:  Use  the  same  construction  as  in  Theorem  1.  Set  e  <  1/3,  and  let  D  be 
uniform  over  the  three  items.  Then  with  probability  >  1  —  <5  the  algorithm,  5, 
must  find  a  configuration  that  is  consistent  with  all  three  of  the  items.  This  implies 
that  B  will  be  a  probabilistic  polynomial-time  algorithm  for  3SAT,  which  implies 
3 SAT  €  RP.1  Assuming  RPy^NP,2  this  is  impossible.  □ 


Statement  8  We  are  not  claiming  that  useful  generalization  can  never  be  per¬ 
formed  by  connectionist  networks.  But  what  we  do  claim  is  that  the  consistency 
problem  is  a  prior  consideration.  If  simple  consistency  cannot  be  achieved  when  re¬ 
quired  (at  least  for  the  target  family  of  tasks),  then  it  is  premature  to  worry  about 
making  predictions  for  unseen  stimuli. 


iThe  complexity  class  RP  is  defined  as  the  set  of  decision  problems  that  have  algorithms  that 
run  in  time  polynomial  in  n  and  1/6  and  which  will  always  return  0  if  the  answer  is  NO,  and  if  the 

answer  is  YES  will  return  1  with  probability  >1-6. 

3  Assuming  RP^NP  is  good  for  your  theorebellum. 
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Chapter  8 

CONCLUSIONS 


8.1  Lessons  Drawn  from  Current  Results 

Loading  is  hard:  The  job  of  simply  remembering  associated  pairs  of  strings 
requires  only  linear  time  in  a  von  Neumann  machine,  but  we  have  shown  that  a 
large-scale  version  of  this  trivial  problem  can  become  very  difficult  if  it  must  be 
achieved  in  a  given  non-recurrent  network.  Hence  there  is  reason  for  connectionist 
research  to  find  out  why  this  phenomenon  occurs  and  how  to  avoid  it.  The  scale-up 
problem  will  not  be  solved  without  a  deeper  understanding  of  the  issues  involved 
in  learning,  and  without  a  narrower  definition  of  what  kinds  of  learning  we  want  to 
achieve. 

Neural  networks  have  been  touted  as  having  more  natural  and  more  powerful 
learning  abilities  than  traditional  AI  learning  systems.  Certainly,  there  is  some 
appeal  and  basis  for  the  argument.  It  is  more  comfortable  to  believe  that  a  small 
adjustment  to  a  few  weights  in  a  net  will  (a)  create  a  new  behaviour  that  is  sub¬ 
stantially  like  the  old  behaviour,  and  (b)  quite  possibly  improve  the  behaviour.  In 
contrast,  a  small  adjustment  to  a  few  bits  in  the  program  of  a  Turing  Machine  will 
(a)  usually  produce, radically  different  behaviour,  and  (b)  often  produce  a  totally 
useless  behaviour.  Whether  this  argument  is  fair  or  not  fair,  this  thesis  has  demon¬ 
strated  that  before  we  can  harness  this  quality  of  gentle  adaptations,  we  still  need 
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to  know  a  lot  more  about  the  network  model,  how  to  design  it,  how  to  program  it, 
and  what  applications  to  put  it  to. 

Issues  of  Node  Function  Sets:  A  significant  set  of  side  questions  arose  during 
our  research  regarding  the  justification  and  appropriateness  of  the  type  of  node 
functions  typically  used  in  the  connectionist  literature1: 

•  Is  there  any  support  for  the  choice  of  node  function  sets  that  use  linear  sum¬ 
ming  techniques?  i.e.  Why  use  LSFns?  LLFns? 

•  Can  learning  theory  speak  to  the  issue? 

•  Are  some  node  function  sets  easier  to  learn  with  than  others? 

We  have  good  evidence  that  the  difficulty  of  the  loading  problem  is  independent  of 
the  choice  of  type  of  functions  that  each  node  can  perform.  For  all  reasonable  sets, 
our  results  are  completely  independent  of  the  choice  of  node  function  set;  hence 
we  conclude  that  nothing  in  our  work  either  supports  or  detracts  from  the  use  of 

LSFns  or  LLFns. 

As  mentioned  near  the  end  of  Chapter  4,  Blum  and  Rivest  [BR88]  have  found 
a  different  proof  of  the  NP-completeness  of  PerfisFns  anci  ^eir  argument  depends 
directly  on  having  to  linearly  separate  many  points  in  s-space.  If  the  three  nodes 
in  their  construction  were  using  AOFns,  LUFns,  some  other  functions  instead  of 
LSFns,  the  proof  wouid  not  hold.  Hence  the  only  evidence  from  learning  complexity 
uncovered  so  far  speaks  somewhat  against  linear  summing!  However  this  is  quite  a 
weak  argument  as  it  stands— there  is  no  need  to  seriously  question  linear  sums  yet. 

Basically,  what  we  can  conclude  from  our  results  being  independent  of  the  node 
function  set  is  that  the  complexity  of  loading  does  not  derive  from  the  node  function 

set. 

i  We  thank  I.  Aleksander  [Ale84]  for  resolutely  avoiding  the  dominant  viewpoint  on  what  consti¬ 
tutes  a  good  node  function  set,  thus  prompting  our  questions  on  the  topic. 
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What  it  does  derive  from  is  the  connectivity  patterns  of  the  network.  This  much 
is  clear.  Notwithstanding  this,  there  is  a  great  deal  of  research  effort  being  put  into 
understanding  linear  threshold  functions  and  also  into  studies  of  linear  threshold 
networks.  For  instance  Minsky  and  Papert  (MP72j  treat  linear  threshold  devices 
as  a  primary  issue.  This  is  reasonable  strategy  for  investigating  the  power  of  small 
networks;  indeed  in  the  case  of  tiny  networks  (i.e.  one  node  or  one  layer),  the  role 
of  the  node  function  set  is  ascendant  because  it  has  an  overwhelming  effect  on  what 
can  be  performed  by  the  net.  In  large  or  deep  networks,  the  role  of  the  node  function 
set  in  determining  computational  power  fades  quickly  and  is  replaced  by  issues  like 
the  size,  depth,  and  connectivity  of  the  net.  We  suggest,  therefore,  that  studies  of 
linear  threshold  devices,  if  not  totally  irrelevant  to  learnability,  are  at  least  guilty 
of  placing  undue  importance  on  an  issue  that  will  only  help  settle  minor  issues.  See 
also  [MP72,  footnote  page  165]. 

Generalization:  Although  generalization  properties  are  exciting  possibilities  for 
neural  networks,  we  have  argued  in  several  ways  that  the  simple  issue  of  consistency 
is  a  central  and  prior  consideration.  Good  generalization  requires  good  memoriza¬ 
tion. 

Design  Constraints:  We  have  shown  that  loading  can  be  hard,  that  it  can  be 
easy,  and  that  one  of  the  things  this  depends  on  is  the  family  of  architectures  being 
loaded.  The  theorems  serve  as  warnings  and  as  guideposts  to  better  designs.  When 
the  SCI  of  a  shallow  architecture  has  limited  tree-width,  loading  is  tractable,  but 
this  constraint  may  not  yield  useful  families  of  networks.  Less  constrained  families 
that  we  looked  at  (e.g.  grid  SCI  graphs)  have  NP- complete  loading  problems. 
These  results  are  some  evidence  that  architectural  constraints  alone  will  not  serve 
as  a  useful  exit  from  NP- completeness.  Other  aspects  of  the  problem  will  need  to 
be  changed,  possibly  in  conjunction  with  architectural  constraints. 


84 


Methodology :  We  have  outlined  a  wide  range  of  questions  regarding  narrowed 
or  altered :models  of  the  connectionist  learning  goal.  The  particular  subcases  con¬ 
sidered  here  are  merely  a  few  of  the  myriad  avenues  open  for  research.  The  tool  of 
jVP- completeness  can  direct  the  search  for  good  learning  rules  and/or  easily-loaded 
architectures  and/or  easily-loaded  tasks  without  requiring  extensive  simulations. 
By  carefully  refining  definitions  and  searching  for  a  more  complete  description  of 
the  boundary  between  solvable  and  infeasible  problems  a  more  useful  theory  will 
develop  that  will  have  applications  to  the  design  of  many  kinds  of  network  machines. 

8.2  Contributions  of  this  Thesis 

We  have  focussed  on  the  scale-up  problem  in  supervised  learning  as  an  area  requiring 
major  effort  and  applied  standard  tools  of  complexity  theory  to  try  to  understand 

it. 

The  first  major  contribution  made  in  this  research  program  is  to  have  identi¬ 
fied  and  formalized  the  basic  computational  problem  underlying  the  connectionist 
learning  problem.  There  are  four  little  parts  that  went  into  its  construction,  (l) 
The  5-step  cycle  of  classical  connectionist  learning  (see  Section  2.5)  which  took  a 
stimulus  and  response  and  produced  a  weight  change,  was  condensed  into  taking 
a  task  into  a  configuration  of  weights.  (2)  The  notion  of  a  node  function  set  was 
generalized  so  that  we  stopped  referring  to  a  configuration  of  weights  and  simply 
referred  to  a  configuration.  (3)  The  distributed  nature  of  the  classical  algorithms 
was  removed  and  supplanted  by  serial  computation.  (4)  The  architecture  was  made 
into  an  explicit  input.  Altogether,  this  gave  us  the  form  of  the  loading  problem  as 
a  function  from  (architecture,  task)  pairs  to  configurations. 

The  computational  question  has  been  demonstrated  here  to  be  of  broad  general 
value  in  finding  design  constraints  for  neural  networks.  There  is  a  very  large  class 
of  related  questions  that  follow  the  basic  formulation  but  particularize  it  by  stating 
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restrictive  definitions  on  the  various  aspects  of  the  problem.  Thus  we  feel  that  the 
development  of  the  formulation  itself  represents  a  major  component  of  this  thesis. 

We  have  recognized  the  relationship  between  our  model  of  loading  and  other 
models  of  learning  given  by  Valiant  and  Gold. 

In  their  book  Perceptrons  [MP72],  Minsky  and  Papert  lament  the  lack  of  an 
effective  procedure  for  loading  networks  and  express  a  hope  that  “some  profound 
reason  for  the  failure  to  produce  an  interesting  learning  theorem  for  the  multilayered 
machine  will  be  found.”  Our  results  supply  such  a  reason,  and  the  proofs  of  our 
theorems  stand  as  opening  insights  into  the  reasons  why  the  loading  problem  is  so 
difficult.  The  fact  that  network  learning  is  NP-complete  may  not  be  surprising  in 
itself,  but  its  proof  is  still  a  valuable  contribution  on  its  own. 

In  penetrating  the  issues  surrounding  expansion  of  network  width,  we  have  devel¬ 
oped  the  notions  of  shallowness  and  SCI  graphs  and  demonstrated  their  usefulness 
by  identifying  some  polynomial  time  problems  and  some  closely  related  problems 

that  are  NP- complete. 

We  have  also  raised  the  question  of  how  we  might  justify  linear  sum  functions 
in  networks  or  find  another  node  function  set  that  might  be  more  appropriate  for 

learning. 

In  attempting  to  answer  that  question,  we  have  found  good  evidence  that  the 
difficulty  of  the  loading  problem  does  not  derive  from  features  of  the  node  function 
set.  In  fact  our  theorems  find  no  evidence  in  support  of  any  node  function  set  over 
any  other  and  we  argue  that  good  research  strategy  ignores  the  particularities  of 
any  one  node  function  set  and  concentrates  instead  on  higher-level  issues. 

We  have  illustrated  why  the  development  of  a  theory  of  learning  in  networks 
would  directly  contribute  to  the  otherwise  black  art  of  network  design.  We  have 

t 

found  numerous  avenues  to  follow  in  order  to  study  the  effect  of  scale-up  on  the 
learning  issue,  and  to  thereby  derive  principles  that  contribute  to  a  methodology  of 
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network  design. 

Lastly,  we  have  developed  the  notion  of  armwidth  (aka  tree-width  and  partial 
A>  trees)  to  characterise  an  important  constraint  on  graphs  that  yields  polynomial 
subcases  for  otherwise  iVP-complete  problems.  This  idea  is  tangential  to  the  present 
document  but  deserves  wide  exposure  to  graph  theorists  and  algorithm  designers. 
Indeed  its  importance  is  underscored  by  the  fact  that  other  researchers  have  inde¬ 
pendently  discovered  the  same  notion  and  papers  on  the  topic  are  now  appearing 
in  the  literature. 

8.3  Future  Work 

The  obvious  extensions  to  this  work  include  refining  the  classes  of  architectures 
considered  and  the  classes  of  tasks  considered,  so  as  to  more  closely  understand  the 
relationship  between  networks  and  what  they  can  learn.  Some  specific  directions 
are  outlined  in  the  following  subsections. 

8.3.1  Task  Constraints 

Although  this  study  has  focussed  on  what  it  expressed  as  architectural  design  issues, 
these  results  could  just  as  easily  have  been  expressed  as  task  design  issues,  and  in 
fact  they  are  both.  Whenever  we  found  poly-time  loadable  architectures  we  were 
also  implicitly  identifying  poly-time  loadable  tasks,  since  the  class  of  tasks  that  such 
architectures  were  capable  of  loading  was  given  as  the  set  of  all  tasks  performable 
by  that  architecture.  Hence  the  investigation  has  had  dual  purpose  throughout. 
However,  it  is  a  limitation  of  this  work  that  we  have  generally  inquired  only  about 
the  ability  of  a  network  to  load  all  of  its  performable  tasks  instead  of  asking  about  its 
ability  to  load  some-useful  subset  of  its  performable  tasks.  This  should  be  explored 

further. 

To  pursue  such  questions,  one  needs  to  identify  interesting  classes  of  tasks  and 
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find  useful  formal  definitions  for  t.itsv  b"  ''.on  5.2  we  used  a  teacher/ learner 
formalism  to  constrain  the  class  of  .  i-s  a  -siwork  might  be  asked  to  load.  This 
technique  was  used  only  in  conjunction  w.'ln  differing  node  function  sets  but  it  is 
also  useful  in  other  contexts.  For  example,  it  might  be  useful  to  be  able  to  describe 
exactly  what  tasks  an  architecture  is  capable  of  learning  by  referring  to  a  teacher 
network  whose  tasks  it  can  easily  load.  As  before,  we  denote  this  by  writing  the 
parameters  for  describing  the  teacher  to  the  left  of  Perf  and  those  for  the  learner 
to  the  right.  For  example  the  question  as  to  whether  network  A1  could  learn  all  of 
what  network  A  could  perform  would  be  APerfA'.  We  ask  if  there  is  a  reasonable 
</>  function  from  architectures  to  architectures  such  that  for  all  A,  Perf ^  1  is 
tractable. 

This  question  can  be  answered  positively.  When  given  a  network,  A,  and  any 
task  of  t  items,  one  can  construct  a  network  approximately  t  times  as  big  as  A  that 
can  easily  load  that  task.  Although  this  is  too  loose  a  4>  function  to  be  termed  a 
‘result’,  it  does  gives  us  an  upper  bound  on  the  size  of  learner  network  required. 
Because  the  factor  is  t  rather  that  some  power  of  t  or  some  exponential  in  1,  we 
believe  that  tighter  answers  to  the  4>  question  might  indeed  be  interesting. 

The  purpose  of  the  teacher/learner  formalism  is  to  unbundle  the  architecture 
class  from  the  task  class  and  to  deal  with  them  explicitly  and  independently. 

8.3.2  Relaxed  Criteria 

The  basic  loading  problem  asked  for  a  guarantee  that  the  algorithm  would  complete 
its  job  of  finding  a  configuration.  It  might  be  that  some  probabilistic  criterion  of 
success  would  be  easier  to  comply  with.  Perhaps  for  some  class  of  architectures  we 
will  be  able  to  find  a  randomized  procedure  that  will  run  in  polynomial  time  and 
report  a  solution  configuration  with  a  certain  minimum  probability.  Repeated  invo¬ 
cations  of  the  procedure  would  give  asymptotic  certainty  regarding  performability. 
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Such  an  algorithm  could  be  used  in  applications  where  it  was  possible  to  judge  how 
much  loading  time  each  situation  warranted. 

8.3.3  Mutating  the  Network 

Another  avenue  of  freedom  usually  not  exploited  by  connectionist  learning  schemes 
is  to  alter  the  architecture  as  learning  proceeds.  When  carried  to  extremes,  this 
would  amount  to  an  exercise  in  circuit  design,  for  which  Valiant’s  formulation  of 
the  learning  problem  is  the  most  relevant.  This  is  a  far  cry  from  current  approaches 
to  connectionist  learning,  but  adhering  rigidly  to  the  starting  architecture  may  be 
too  constrictive;  somewhere  between  these  two  extremes  we  may  find  a  scheme  that 
combines  the  best  of  both  approaches. 

8.3.4  Returning  to  Classical  Form 

As  discussed  in  Section  2.6,  the  loading  problem  is  on  the  easy  side  of  three  is¬ 
sues,  and  therefore  whenever  a  tractable  loading  problem  is  identified  we  do  not 
have  complete  evidence  that  the  problem  will  be  easy  in  the  classical  connectionist 
setting.  For  such  a  case  we  would  still  have  three  aspects  to  adapt. 

•  The  type  of  machine  used:  The  serial  algorithm  would  have  to  be  broken  up 
and  distributed  throughout  the  network. 

•  The  style  of  processing  required:  The  process  would  have  to  be  re- implemented 
in  a  ‘neural’  style. 

•  The  type  of  information  available:  The  system  would  have  to  be  altered  to 
accept  information  in  an  on-line  fashion. 

All  of  these  transformations  would  require  a  special  research  effort  since  none  of 
them  are  well  understood. 
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8.3.5  Recurrent  Networks 

We  have  specifically  focussed  on  feed-forward  networks.  Recurrent  networks  have 
a  fundamentally  different  retrieval  process  in  that  they  start  at  some  point  in  state 
space  and  under  the  influence  of  the  input  they  travel  through  state  space,  possibly 
reaching  a  stable  point  or  a  limit  cycle.  The  definition  of  what  constitutes  its 
‘output’  may  therefore  be  problematical,  but  this  sort  of  machine  is  very  interesting 
and  the  problem  of  loading  them  should  be  studied. 

We  have  suggested  that  our  shallow  feed-forward  models  might  be  relevant  to 
(long-term)  storage  of  information  in  the  brain.  Hypotheses  about  short-term  mem¬ 
ory  in  the  brain  are  often  based  on  cyclic  electrical  mechanisms  which  require  re¬ 
current  networks. 

8.3.6  Other  Learning  Paradigms 

We  have  limited  our  inquiries  to  the  supervised  learning  paradigm.  Many  other 
types  of  protocols  (e.g.  unsupervised  learning,  or  the  use  of  queries)  are  useful 
models  of  learning  environments  but  have  not  been  formally  explored  in  the  context 
of  learning  in  networks. 

8.4  Philosophical  Summary 

A  theory  is  developed  by  progressing  from  one  hard,  clear  definition  of  a  problem 
to  another.  Clearly,  at  this  point  in  time,  it  is  still  ill-defined  what  connectionists 
require  of  a  learning  system.  There  are  many  formulations  of  it  other  than  ours 
that  might  be  appropriate  for  different  situations.  It  may  be  reasonable  just  to  ask 
for  the  ‘best’  configuration  for  a  network,  rather  than  the  ‘correct’  configuration.  It 
may  be  reasonable  just  to  ask  for  the  configuration  that  yields  performance  better 
than  a  simple  regression  procedure  would.  It  may  be  reasonable  just  to  ask  for  a 
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configuration  that  makes  maximal  use  of  the  hardware  (i.e.  supports  the  greatest 
number  of  items  in  the  given  network).  The  contribution  of  such  research  may  be 
to  help  define  connectionist  learning  by  showing  which  formulations  are  achievable. 
We  have  formulated  a  basic  question.  Other  formulations  based  on  more  refined 
definitions  could  lead  to  successively  more  useful  models  of  practical  connectionist 
concerns.  Because  no  exact  definitions  of  connectionist  learning  are  yet  widely 
accepted,  we  think  that  an  analysis  of  various  definitions  leading  to  tractable  loading 
problems  would  help  establish  and  focus  the  research  in  this  area. 

The  successful  development  of  a  theory  of  an  intensely  complicated  system  like 
the  brain  depends  on  a  judicious  sequence  of  selections  of  constraints.  To  begin,  one 
must  select  one  or  two  appropriate  constraints;  then  study  them  to  understand  how 
they  interact;  choose  another  constraint;  then  add  it  to  the  others  and  elaborate 
further.  At  each  choice  point,  one  must  be  carefully  conscious  of  what  level  of 
detail  the  system  is  being  modelled  at  and  choose  constraints  that  act  at  that  same 
level.  We  think  there  has  been  too  much  emphasis  placed  on  modelling  brains  at 
the  level  of  neurons  using  constraints  like  spike  train  frequencies,  linear  threshold 
functions  or  the  sodium  pump.  This  is  like  trying  to  discover  the  principles  of  flight 
by  studying  the  microbiology  of  birds.  The  useful  level  of  study  is  much  coarser 
than  that.  Similarly  here,  just  by  taking  a  view  from  the  next  larger  scale  of  detail, 
our  investigations  have  discovered  a  universe  of  issues  that  are  almost  oblivious  to 
the  functionality  of  individual  nodes.  We  suggest  therefore  that  these  coarser  levels 
of  detail  would  be  more  productive  levels  of  modelling  for  computer  scientists  to 
pursue.  Our  hunch  is  that  after  a  theory  of  learnability  is  fleshed  out,  tuning  it  for 
a  specific  node  function  set  will  change  things  only  slightly.  In  general  the  coarser 
levels  are  the  more  important  levels. 

f 

What  we  have  attempted  here  is  to  look  at  the  level  of  mid-size  neuroanatomical 
structures  (e.g.  cortical  slabs),  and  we  hope  that  our  choice  of  simple  constraints 
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will  prove  propitious.  We  have  pursued  the  study  of  feed-forward  networks  and 
especially  the  architectural  family  of  shallow  networks  because  of  their  potential 
for  modelling  structures  in  natural  brain  cortex.  Our  model  will  be  relevant  if  we 
have  been  lucky  in  choosing  constraints  and  if  the  neural  structures  they  model 
happen  also  to  be  engaged  in  the  kind  of  information  loading  and  retrieval  that  we 
are  exploring.  We  might  have  the  wrong  model  of  the  salient  aspects  of  these  slabs 
of  cortical  columns;  we  might  have  the  wrong  model  of  how  these  slabs  actually 
retrieve  their  stored  information;  or  we  might  just  be  asking  the  wrong  analytical 
question.  (The  performability  question  used  here  requires  total,  exact,  dependable 
recognition  of  the  set  of  performable  tasks.  This  seems  unduly  demanding  and  of 
the  three  suspicions  listed  here,  the  last  one  seems  to  deserve  the  first  examination.) 

Whatever  the  case,  our  underlying  assumption  is  that  complexity  analysis  (and 
specifically  the  P  vs  NP  distinction)  provides  a  means  to  narrow  down  the  things 
that  biological  machines  do  and  how  they  do  it.  Our  strategy  is  to  take  the  general 
iVP-complete  problem  and  add  architectural  constraints,  task  constraints,  or  other 
types  of  constraints,  and  search  for  polynomial-time  loading  problems.  We  feel  very 
safe  in  assuming  that  the  brain  cannot  be  solving  any  NP-hard  problem,  and  we 
fee!  secure  in  assuming  further  that  evolution  would  have  found  efficient  ways  to 
utilize  the  available  hardware.  Ergo  brain  mechanisms  are  likely  to  be  described 
by  decision  problems  found  ‘just  below’  the  level  of  NP- completeness.  Hence  the 
general  outline  and  thrust  of  our  research  program. 

By  providing  guidelines  for  ensuring  that  a  network  can  learn  efficiently,  we 
hope  to  contribute  to  an  urgently  needed  general  methodology  of  how  connectionist 
networks  should  be  constructed.  And  by  distinguishing  between  those  forms  of 
learning  that  are  achievable  and  those  forms  that  are  not,  we  will  be  helping  to 
identify  the  applications  to  which  neural  networks  can  be  profitably  applied.  This 
thesis  has  provided  only  the  first  steps  toward  such  a  theory. 
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Appendix  A 


ALTERNATE  PROOF 
OF  GENERAL  THEOREM 


This  appendix  proves  a  slightly  weaker  version  of  Theorem  1  in  that  it  uses  a  node 
function  set  called  SAFns,  which  is  larger  than  AOFns.  SAFns  is  the  set  of  node 
functions  that  can  be  constructed  with  a  single  AND  gate  augmented  with  optional 
inverters  at  the  inputs  and  output.  The  construction  in  the  following  proof  is 
somewhat  different  from  those  used  in  earlier  chapters.  The  next  appendix  extends 
this  theorem  to  real-valued  node  function  sets  using  the  same  construction  used 
here. 

First,  we  introduce  some  general  purpose  notation  for  manipulating  strings.  If 
a  and  /?  are  strings  then  a  •  0  is  the  concatenation  of  a  and  f3,  and  a  is  the 
concatenation  of  n  copies  of  a.  We  use  Of=i  to  denote  <*1  •  a2  ■  a3  •  . . .  ■  an- 

If  a  is  a  string,  A  and  B  are  sets  (with  distinct  elements),  B  C  A,  and  the 
length  of  a  is  |A|,  then  the  notation  a[£]  denotes  the  string  of  length  |R|  that  is 
formed  by  associating  successive  elements  of  a  with  successive  members  of  A  (which 
has  an  implicit  ordering),  and  then  selecting  from  a  only  those  elements  that  are 
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assoc:  vied  with  members  of  B.  For  example, 

if  a—  2  ■  7  ■  4  •  1  •  9  ■  8 
and  A — {  djQ  ?di9  ^ 

and  B  =  {dio,dj7,  dig} 

then  a[g]—  2  •  9  •  8" 

Another  notational  device  is  used  to  select  single  elements  from  a  string;  a(k) 
represents  the  kth  element  of  a.  Formally,  a(k)  =  Q[f^2-3’  -’a>]  where  a  is  the  length 
of  a. 

For  precision,  we  define  the  semantics  of  computation  in  a  network  as  the  unique 
string  that  satisfies  the  inductive  expression 

n 

C ompp(cr)  =  a  ■  Q  /,  (Compeer)  [£,,.)]) 

«=i 

Such  a  string  is  unique  because  A  is  acyclic  and  the  output  of  each  node  is  dependent 
only  on  the  output  of  previous  nodes.  The  network  mapping  can  now  be  stated  as 

■M£(<r)  =  Comppi  <7)[£] 

Theorem  21  PecfSAFns  is  NP-complete. 

Proof:  We  reduce  the  classic  satisfiability  problem  (SAT)  to  PerfsAFns-  (See  [GJ791 
for  an  explanation  of  this  process.)  Let  (U,T)  be  an  arbitrary  instance  of  SAT, 
where  U  is  a  set  of  variables  and  T  is  a  set  of  clauses;  U  —  {ui,u2, .  • .  T  = 
{(7,,  G',)  :  1  <  i  <  m}.  We  use  a  novel  representation  of  T,  the  set  of  clauses:  for 
each  i  <  m,  7,-  €  {0, 1}W,  and  G,  C  U.  A  string  II  is  said  to  satisfy  the  instance 
(US)  iff  nfc  )  ^  7>[g  ]  f°r  8  -  m •  (This  representation  of  a  clause  can  be 
obtained  from  the  traditional  disjunctive  form  by  applying  de  Morgan’s  Law  once 
and  padding  for  variables  that  are  not  in  the  clause.) 
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We  must  construct  an  architecture  A  and  a  task  rnicb  '  '  T  is  performable 
by  A  iff  (U,  T)  is  satisfiable.  The  set  of  nodes  V”  win  be  co  oi  a  set  \\  of 

“first-layer  nodes”  and  a  set  V2  of  “second-layer  nodes'  . 

S  =  {n0,i  :  0  <  i  <  w} 

Vi  =  {vhj  :  1  <j<w} 

V2  =  { V2,i  :  1  <  t  <  m} 

P  =  S  U  Vi  U  V2 
R  =  V  =  Vx  U  V2 

E  =  {{vo,o,vltj),{v0,j,vu)  :  1  <  j  <  w)  U  v2li)  :  uj  €  Gt} 

A  —  (P,  V,S,R,  E) 

The  task  is  composed  of  3  kinds  of  items.  The  first  kind  is  called  the  truth-value 
items”  and  associates  a  binary  value  with  ‘true  and  ‘false’: 

r1  =  {(o-ou,!o,u-*m),(o-r,r-*m)} 

The  second  kind  of  item  is  called  the  “disjunct  semantics  items”: 

T2  =  {(0  ■  li7  ■  P-1  ■  0  •  :  (7t>  Gj)  €  r> 

The  third  kind  of  item  is  called  the  “conjunct  semantics  item” : 

T3  =  {(1  -0W,*W  •  lm)} 

T  =  Ti  U  T2  U  T3 

Figure  A.l  gives  a  construction  for  an  example  instance  of  SAT. 

Claim:  A  solution  configuration,  F,  for  {A7T)  exists  iff  a  satisfying  assignment 
If,  exists  for  (t/,  T). 

proof  (3 F  *=  3fl):  Assume  {U,T)  G  SAT  by  virtue  of  the  satisfying  assignment 
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Take  as  an  example  the  following  SAT  problem,  expressed  in  traditional  CNF 
form:  (uj  V  u2  V  u3)(u2  v  ul  V  «7).  In  the  required  form,  this  is  equivalent  to 

71  =  1  ■  0  •  0  •  0  Gj  =  {«!,  u2,u3} 

72  =  0  ■  0  •  1  -  1  G2  —  {u2,  u3,  u4} 

The  task  for  this  problem  is 


Ty  = 

:  (0 

•  0 

•  0 

■  0 

•0, 

0  • 

■  0 

•  0 

■  0 

•  * 

■*) 

(0 

•  1 

■  1 

•  1 

•  1, 

1  ■ 

1 

•  1 . 

1 

■  •* 

•  *) 

T2  = 

:  (0. 

•  1 ' 

■  0 

■  0 

0, 

*  - 

'  *  ' 

'  *  1 

•  *  ' 

■  0 

'*) 

(o. 

0  ■ 

0 

1  ■ 

t, 

*  • 

*  ■ 

'  *  « 

'  *  ■ 

•  *  • 

■0) 

r3  - 

(1- 

0  ■ 

9  • 

0  ■ 

0, 

*  • 

*  * 

*  « 

*  ' 

1  • 

1) 

The  architecture  is  as  follows: 


Figure  A.l:  Example  construction  for  proof  of  theorem  using  SAFns 
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string  II.  Associate  the  node  function  fkti  with  each  node  vka  £  V ,  and  then  let 
F  =  {/i,  i,/i,  2,-  w, /a, •■•>/».«}  where 


6  if  a  =  0 

no-)  if  o  =  i 


hAa) 


0  if  ct  =  lifcj 
1  otherwise 


We  must  show  that  Mp  D  T,  which  we  do  by  showing  MF  2  Tu  -Mf  2  T2,  and 
Hp  2  T3  individually.  First,  note  that  since  /i,;(0  •  b)  =  6  for  all  j  <  w,  we  have 
for  any  a, 

Ul  VI  V) 

C'ompp(0  ■  a)[f,]  =  ©  /,.,((0  ■  a)[ps,„,,)i)  =  O  /u( °  '  “0'))  =  O  “  (A.l) 

j-1  J  =  1 


Equation  A.l  proves  Comp$(c r)(£j  =  p[$x\  for  both  items  {a,p)  €  Tj.  Since  re¬ 
sponses  for  V2  are  undefined,  Mp  2  T\. 

For  each  u2 ,»  €  V2  there  is  only  one  item  in  T2  which  is  defined,  and  to  agree 
with  that  response,  we  must  show  that  M£(0  •  7,-)[v2l(0  =  °- 


J*#(0-  =  fiACompfr 0-n)[r{vJ) 

=  /2><(Comp^(0-7i)[vJ[p(Ba.1.)l)  since  p{v2<i)  C  Vx 

-  /vb.fpk,)])  by  (a-1)  above 

=  /2)i(7«[Gf-3)  by  definition  of  E 
=  0  by  definition  of  /2)j,  as  required. 

Since  this  argument  holds  for  every  node  in  V2,  and  responses  for  Fj  are  not  defined, 

Mp  2  T2. 

The  only  stimulus  in  T3  is  1  ■  0™. 


Comp*( l-0")[f,l  =0/w((i  •«>*)&.  J)  =  O/u(‘-0)  =  On«  =  n 


3  =  1 


;= 1 


J=1 
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v£V. 


Compel  1-0W)[£J  =  O  f2,i  ({Compj{l  •  O1"))^]^.,])  since  (J  p(u)  C  V' 

t=l 

m 

=  0/2,«(H[p(v2,)])  by  the  previous  equation 

»=i 

m 

—  0/2,-(n[g.])  by  definition  of  E 


-  1 


«=i 

m  i _ _ _r  t  _ i 


by  definition  of  f2,t  and  ITf^J  /  ^[c,- 


So  M£(l  •  0”)  =  n  •  lm  f=  •  lm  which  is  to  say  Mp  2  T3.  This  completes  the  first 
half  of  the  claim. 

proof  (3F  =>  BIT):  Assume  F  —  {/i,i,  /i,2>  •  •  •  i  fi.w,  fi,u  /2,2,  •  ■  • » /2,m}  Is  a  config¬ 
uration  such  that  2  T.  What  do  we  know  about  F?  By  inspecting  Tlt  we 
know 

Comrf(0  •  0")[?.I  =  ©  hA (0 ' 0-)6„,,)])  =  O  A,i( 0 ' 0)  =  0" 


J-l 


J  =  l 


by  the  first  item.  Hence  /u(0  •  0)  =  0.  By  the  second  item,  we  can  similarly  show 
/  .( o  •  1)  =  1,  which  leads  us  to  conclude  what  was  shown  in  equation  (A.l). 

By  inspecting  T2  and  T3,  we  have  for  every  i,  1  <  i  <  m 


fu{Compj{  0  •  =0^1  =  Mc°mPr{ 1 ' 

Compj{ 0  ■'7i)[p(„!!.,)]  ^  Compp[  1  -  0** ) [p(Wa<1-) ] 
Applying  equation  (A.l)  and  the  definition  of  E  on  the  l.h.s., 

l.h.s.  =  Compp{ 0  •  7.)[vJ[p(0,,)j  =  ^[p(‘U2,)]  =  T»[g,-  1 

Simplifying  the  r.h.s.  by  letting  II  =  Compel  ■  0U,)[£J, 

r.h.s.  =  Compp{  1  •  0w)(vJ[p(„3  ,)j  =  n[p(«2,)l  =  n(Gtl 
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Reassembling,  we  have  '7.[g,]  7^  n^.j  for  all  i,  1  <  i  <  m  which  is  to  say  that  IT 
satisfies  {U,T}  and  the  claim  is  proved. 

Thus  we  have  SAT  cc  Perf gAFns  and  ^  *s  easy  to  see  that  the  algorithm  for  the 
transformation  runs  in  polynomial  time  (in  fact  linear  time  and  log  space). 

Finally,  it  must  be  demonstrated  that  there  is  a  non-deterministic  machine  that 
can  decide  PerfgAFns  in  time  polynomial  in  the  length  of  (A,T).  That  is,  there 
must  be  a  poly-time  method  of  writing  down  a  valid  SAFns  configuration  and 
checking  that  it  is  correct.  Writing  down  a  function  from  SAFns  requires  one  bit 
for  every  nodal  input  (to  specify  whether  it  should  be  inverted  before  entering  the 
AND  gate),  and  one  bit  for  the  output  (to  specify  whether  the  whole  function  should 
be  inverted).  For  the  complete  configuration,  this  takes  one  bit  for  each  edge  in  A 
and  one  bit  for  each  node  in  A.  That  the  configuration  is  correct  can  be  checked  by 
evaluating  each  node  function  once  for  each  item  in  T.  This  takes  time  0(|Vj  x  |T|) 
under  the  assumption  that  it  takes  constant  time  to  evaluate  any  single  /*. 

This,  and  SAT  a  PerfsAFns  implies  PerfSAFns  iVP-complete.  □ 
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Appendix  B 


PROOF  FOR  LOGISTIC 
LINEAR 

NODE  FUNCTIONS 


The  next  theorem  extends  the  previous  one  to  the  case  of  certain  real-valued  node 
functions.  We  consider  a  function  set  used  in  [RHW86]  wherein  every  member  of 
the  set  is  a  function  composed  of  two  parts.  The  first  part  is  the  logistic  function 
and  the  second  is  a  linear  weighted  sum  of  its  inputs. 

f(a)  =  E(e(a)) 

where  efct)  =  wp  +  w,  x  a(t), 

and  E(x)  — - . 

y  ’  l  +  e~x 

We  call  these  functions  LLFns  (for  Logistic  Linear  Functions).  The  E  function  is 
fixed  for  all  nodes,  so  to  specify  a  member  of  LLFns  it  is  enough  to  specify  the 
weights  Wo,Wi, . . .  used  in  e. 

Following  [RHW86]  again,  we  say  that  a  value  agrees  with  1  if  it  is  no  smaller 
than  0.9,  and  it  agrees  with  0  if  it  is  no  larger  than  0.1.  Note  that  E(x)  asymptoti¬ 
cally  approaches  1  as  x  approaches  +oo,  and  that  E{x)  asymptotically  approaches 
0  as  x  approaches  —  oo.  Let  d  be  some  scalar  value.  We  say  that  a  agrees  for  high 
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d  with  (3  (written  a  ^  /?)  if  there  is  some  value  for  d  beyond  which  a  always  agrees 
with  (3.  This  implies  that  the  value  of  a  or  (3  is  a  function  of  d. 

a(d)  |=  (3{d )  <^=>  3do  such  that  a(d)  \=  /?(<i)  for  all  d  >  do 

Such  agreement  is  easy  to  prove  if  a  is  monotonic  in  d  and  (3  is  constant. 

Note  that  if  two  such  agreement  statements  hold  for  the  same  high  parameter 
then  they  hold  simultaneously  for  that  parameter. 

(a  f4  (3  and  6  £)  «=>  a  •  <5  ^  0  ■  £ 

A  new  notational  device  is  used  to  select  single  elements  from  a  string  in  the  case 
where  the  element’s  position  in  a  string  is  not  known  except  through  its  relative 
position  in  one  of  the  clause  sets,  Gt,  in  I\  For  that  situation,  we  use  j  to 

mean  the  index  in  U  of  the  kth  element  of  clause  i.  Formally,  \  -  (OJ=i 
Consequently,  this  identity  holds:  a<£)  =  o^JW- 

Theorem  22  PerfLLFns  is  NP-complete. 

Proof:  We  construct  a  performability  problem  ( A,T )  where  the  architecture,  A ,  is 
the  same  as  it  was  in  the  proof  of  Theorem  21  except  that  R  —  V \  instead  of  R  —  V, 
and  the  task,  T,  is  as  follows: 


T  =  T,  U  r2  U  T; 

Ti  = 

{(0-7  0 

*,,_1  ■  0  •  *m~‘)  : 

1  <  t  <  m} 

t2  = 

{{(O'  77, 

+‘-i  .  1  .  *m-‘)  : 

1  <  k  <  |G,-|}  :  1  <  i  <  m} 

Tz  = 

{(1  -  ho 

*-1  •  1  •  *m~')  : 

1  <  i  <  m} 

where  7^  is  7,  with  the  kth  relevant  bit  inverted: 

„(*)/,-)  =  J  1-7.0')  ifJ  =  i 

u '  1  7  {(j)  otherwise 
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• -ii:  !.  hare  exists  a  solution  configuration  F  to  (A,  T)  iff  there  exists  a  solution 
assign:^:.',  n  to  {U,T).  For  both  directions  of  the  proof  we  shall  use  the  following 
definitions  (they  each  stand  for  the  computation  performed  by  the  first  layer  of 
nodes  when  the  net  is  given  some  stimulus  in  the  task): 


ft  =  Comp^( 0  •  fti)[£j  {from  Tx) 

j3jj)  =  Compp( 0  •  7,(j))(vJ  (from  TA 
rjt  =  Compel  •  -7i)[£j  (from  Ts) 
proof  (3 F  <=  3n):  Specify  the  node  functions  as  follows: 

.  .  _  /  E{-d  +  2da  +  2db)  ifn(j)  =  l 

hAa  ■  b)  -  j  E^d  _  2 da  +  2db)  if  n O')  -  o 


hAa)  =  E(eiAa)) 


where 


|C.i 

e2  i{a)  =  -d  +  2d  Wiik  x  (ft{£)  -  a{k)) 

fc= l 


Witi 


+  1  if  li{\)  =  1 
-1  if  h»{*:)  —  0 


The  above  expression  for  &2,i  is  riot  in  standard  form  but  it  is  straightforward  to 
rearrange  it  so  that  it  is. 

We  shall  check  that  each  subtask  is  performed  correctly  by  this  configuration. 
Observe 

/u(0-0)  =E{-d+-0  +  0)  £  0 
flJ{0-l)  =  E(-d  +  Q  +  2d)  1 

Hence  Comp${ 0  •  <*)[£,]  ^  a.  Consequently  ft  ^  ft,.  Also 


/2,i(ft( 


V, 

P(«2.i) 


]) 


=  E(—d  +  2  dJ^WiAAi)  -  ^  0 

k 
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The  agreement  holds  because  the  total  value  of  the  summation  is  0.  This  argument 
applies  to  each  value  of  t,  and  hence  for  high  d,  Mp  2  Tj. 

Consider  a  typical  item  in  T2.  Note  that  fl\k^  |=^  i\k\  and  that  £)  therefore 
differs  from  7,(~)  as  d  increases.  The  absolute  difference  converges  monotonically 
to  1,  so  we  have 

Here  we  know  the  agreement  for  high  d  holds  because  the  total  value  of  the  sum* 
mation  tends  to  1  as  d  increases.  Since  the  equation  is  valid  for  all  1  <  k  <  !<?;)  for 
each  7,,  M £  2  T2  for  high  d. 

Next  we  consider  a  typical  item  in  T3.  For  all  nodes  in  layer  1,  observe 
if  n<j)  =  1,  hj{  1  •  x)  =  E{-d  +  2 d  +  2 dx)  ^  1  for  x  £  {0, 1} 

if  n<i)  =  0,  /u(l  ■  x)  -  E(-d  -2 d  +  2 dx)  |=  0  for  x  €  {0, 1} 

Hence  fu(  1  •  x)  U(j)  and  consequently  rj,  ^  11  for  all  i.  Examining  the  second 
layer,  we  know 

=  rH+MEMs®  -  I4  i 

because  as  d  increases  the  summation  converges  to  some  integer  representing  the 
number  of  places  where  f<[g.J  is  not  equal  to  that  is,  the  number  of  places 

where  7^,]  is  not  equal  to  II[g.].  By  the  initial  assumption  about  II,  this  integer 
is  at  least  1,  so  the  agreement  holds  (for  high  d).  This  demonstrates  that  2  T3 
for  high  d. 

By  selecting  some  value  for  d  which  satisfies  all  the  above  agreements,  2  T 
and  this  completes  the  proof  of  one  direction  of  the  claim. 
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proof  (3F  =>  311):  Let  yjk  and  z.;_k  be  Uis  weights  employed  in  the  node  functions 
as  follows:  for  all  i,j,  1  <  i  <  m,  1  <  <  v>,  lot 

/ij  (a  •  b)  =  E{yh o  +■  y,-.ia  + 

/?,.(«)  =  E{zi,0  +  E  zi[ka{k)) 
fc= i 

Define  the  satisfying  assignment: 

0 

We  must  show  IT  satisfies  (U,  T). 

By  assumption,  the  configuration  F  performs  T\  and  T%,  so  we  know  for  each  i, 
1  <  i  <  m  and  for  any  A;,  1  <  k  <  \Gi\ 

f 2>*  (?•  N  0 

hAPi  ^pifwj.,-)])  N  1 

SO 

hAAvPUJ)  <  hAtiXuJ) 

E(zifi  +  E*> fi<i»  <  ^(^,0  +  E** 0!kJ(}» 

c  C 

but  c,{»  =  Pik)(j)  for  a11  J  #  i*  or  more  specifically  ft(j)  #  only  when  c  =  fc. 

Therefore 

Let  j  —  |  and  expand  both  sides  in  terms  of  /ij. 

zlME{yjfi  +  yjfl  7i(j))  <  zi>kE{yjfl  +  yy,2(l  -  7,0'))) 

ZiM  Vi, 2  li({)  <  Zi,k  Vj, 2  (1  -  li('k)) 
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if  7*0)  =  0  then  0  <  ziM  yh 2 
if  7iO)  =  1  then  ziik  y 3-2  <  0 


(B.l) 

(B.2) 


Again,  by  assumption  that  F  configures  the  net  for  T\  and  T2, 

/2,i(^'[p(«2.,)I  H  0 
/a,f(»7<[p(„,.()l  1=  1 

E{zifi  +  2,-,*  ?>{*.))  <  E(Zifl  +  E  Zi:k  Vi(k)) 

k  * 

Ez>,* ?<(*>  <  E3*.*7^*) 

k  Ic 

For  this  to  be  true  for  a  given  i,  there  must  be  at  least  one  such  that 

Zi.k  ?<{£)  <  *,*  »7i<i) 

Letting  j  =  £  and  expanding  both  sides  as  fij, 

Zi,k  E{yj,0  +  yj, 2  7*0))  <  E(yji0  +  yj,i  +  yy, 2  '7*0')) 

0  <  Zi,fc  yyi 

From  this  and  (B.l),  we  find  that 

li{j)  =  0  =>  0  <  2i,jt  Zi,k  Vj.i  Vi, 2  =►  0  <  yyi  Vi, 2  =>  n(j)  -  1 

Similarly,  7^0)  =  1  =>  110)  =  0-  Summarizing,  for  all  7,  there  exists  a  k  such  that 
7t(i)  ^  n<f),  or  rather  7<[g.]  #  IT[gJ.  That  is,  Ft  satisfies  (17, T)  and  the  claim  is 
proved. 

The  claim  establishes  that  the  reduction  from  SAT  is  valid.  Since  the  transfor¬ 
mation  can  be  performed  in  polynomial  time,  PerfuFns  is  NP-hard. 

PerflLFns  is  in  NP  if  there  is  a  P°lynomial-time  procedure  to  write  down  values 
for  all  the  weights.  For  the  case  where  the  weights  are  truly  real-valued  (meaning 
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that  a  weight  would  have  a  potentially  infinite  number  of  digits),  it  has  not  yet 
been  proven  that  there  is  a  finite  approximation  that  is  effectively  equivalent  to 
the  real  numbers  (as  Hong  has  cone  for  LSFns).  However,  for  the  more  realistic 
case  of  fixed  resolution  in  each  ‘real’  weight,  specifying  the  configuration  is  easily 
performed  in  polynomial  time.  With  that  minor  caveat,  we  have  proved  PerfLLFns 
is  NP- complete.  ^ 

Three  aspects  of  LLFns  are  crucial  to  the  preceding  proof:  E  is  monotonic,  E 
is  bounded,  and  e  is  linear.  Other  aspects  were  convenient  but  not  necessary;  for 
example,  every  node  had  a  fixed  E  function,  every  node  had  the  same  E  function, 
and  that  E  was  onto  the  unit  interval  [0, 1).  We  proved  the  theorem  for  LLFns  only 
in  order  to  avoid  excessive  abstraction,  but  the  theorem  is  extendible  to  other  node 

function  sets. 

If  we  define  the  quasi-linear  functions  (QLFns)  as  all  those  functions  of  the 
form  E{e(a))  where  e  is  linear  and  E  is  a  bounded  and  monotonic,  then  for  some 
appropriate  definition  of  agreement  we  have 

Corollary  23  PerfQLFns  is  NP-complete. 

The  theorem  is  probably  extendible  to  different  manifestations  of  non-linearity, 
but  we  note  that  something  about  E  should  be  non-linear,  for  if  E  (as  well  as  e) 
is  linear,  then  the  net  as  a  whole  can  implement  only  linear  mappings.  From  the 
point  of  view  of  connectionists,  this  is  uninteresting. 
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Appendix  C 


PROOF  FOR  CASE 
WITHOUT  DON’T  CARES 


The  proof  of  Theorem  1  uses  the  *  symbol  to  denote  ‘don’t  cares’  in  the  response 
strings.  This  is  often  not  a  feature  of  connectionist  experiments  so  the  following 
proof  avoids  the  *  in  order  to  demonstrate  that  it  is  not  an  important  change  to 
the  model. 

Theorem  24  Perf^oFns  15  NP-complete  even  when  responses  have  no  *’s. 

Proof:  by  reduction  from  3SAT.  The  proof  is  modelled  on  the  one  for  Theorem  1. 
Let  the  3SAT  problem  be  (Z,C)  where  Z  is  a  set  of  variables  {ft,  ft,  ft,  •  •  ■}  and  C 
is  a  set  of  disjunctive  clauses  over  them.  Let  w  —  \Z\  be  the  number  of  variables 
and  m  =  |C|  the  number  of  clauses.  For  (Z,C)  to  be  satisfiable,  there  must  be  an 
assignment  fl  :  Z  — *  {0, 1}  such  that  at  least  one  literal  in  each  clause  has  value  1. 
Formally,  the  3SAT  instance  {Z,C)  is  reduced  to  (A,T),  where 

A  =  (P,  V,S,R,  E) 

S  =  {o,  b,  d,  e} 

V  =  {ui,Vi,Wi,Xi,yi,Zi  :  ft  €  Z}  U  {c;  :  Cj  e  C} 

R  =  {ui,Xi,yi,Vi  :  $  €  Z}U  {cj  :  Cj  €  C} 

P  =  5UV 
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E  -  {(a,uft),  (a,  Zi),  (6,uft),  (6,  £,), 

,  u i) ,  (tUj ,  xt ) ,  ( u>i ,  j/t ) )  (zi ,  X,),  (2, ,  yi ) i  ( 2^ ,  U, ) , 

(d,u,),(d,t;,),:  &  e  Z} 

U  {(u\,c;)  :  ft  €  c;}  U  {(2,,Cj)  :  ft  €  C;}  U  {(e,c,)  :  C7  G  C} 

r  =  {A,/,,/*} 

it  =  (0011,  (oooorom) 

/2  =  (1  1  1  0,  (1  1  1  l)®  om) 

/3  -  (0  10  1,  (0  0  10)wlm) 

This  construction  is  explained  in  a  2-stage  example.  Stage  1:  For  every  variable 
ft  €  Z  construct  the  partial  architecture  and  partial  task  shown  in  Figure  C.l.  This 
is  very  similar  to  Figure  4.1  on  page  42.  The  differences  are  that  w  and  2  are  no 
longer  network  outputs;  instead  they  go  to  new  nodes  u  and  v  which  are  network 
outputs.  Also  there  is  a  new  input,  d,  which  goes  only  to  these  new  nodes.  In  the 
task,  note  that  all  response  bits  for  u  and  v  are  the  same  as  they  were  for  w  and 
s  except  that  the  *’s  have  been  replaced  by  0’s  (arbitrarily).  In  the  items  where  w 
and  2  had  been  defined,  d  is  a  1;  in  the  items  where  w  and  2  had  been  ‘don’t  cares’, 
d  is  a  0. 

From  items  1  and  2  we  know 

/«(/»(0,0),  1)  =  0  #  1  =  UUh  I),  1).  (C.l) 

Hence 

fw(0,0)  #/„(!, I)-  (C*2) 

Similarly 

/«(0,0)#/,(l,l).  (C.3) 

By  comparing  item  2  and  item  3  we  know 

/z  (  /uj  ( 1 7  l)>  /z(l)  1))  —  1  7^  0  =  fx{fvj[0,  1),  fz{  O?  1)) 
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/«,(!,  1)  +  A(0, 1)  or  fz{  1,1)  7^  A(0, 1) , 
and  by  using  (C.2)  and  (C.3) 

AM  -  AM  or  A (0,0)  -  A(0, 1).  (C.4) 

By  comparing  item  1  and  item  3  we  know 

A  (A  (0,0),  A  (o,o))  =  o  #  i  =  A(A(o,  i),  A(o,i)) 

A (0,0)  #  A (0,1)  or  A(0,0)  #  A(0,1)  (C.5) 

We  will  associate  some  SAT  variable  a  with  the  group  of  nodes  in  this  con- 
struction.  For  mnemonic  value  and  brevity,  let  A)  stand  for  the  truth  of  the 
inequality  A(0,0)  #  A(0,l)-  And  let  A)  stand  for  the  truth  of  the  inequality 
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; }.  Translating  (C.4)  and  (C.5)  we  have 

(not{f)  or  not(f))  and  ((f)  or  <f)) 


which  implies  (ft)  =  not(f). 

Stage  2:  For  each  clause  in  the  SAT  system  construct  a  single  node  in  the  second 
layer  of  the  architecture  with  inputs  from  all  nodes  associated  with  its  participating 
literals  and  an  input  from  post  e.  Putting  variables’  nodes  and  the  clause  node 
together,  we  get  what  is  shown  in  Figure  C.2.  It  shows  the  construction  for  an 
example  SAT  system  consisting  of  only  one  clause  (ft,  ft,  ft).  Observe  that  each 
item  consists  of  the  stimulus  from  an  item  from  Figure  C.l,  a  new  stimulus  bit 
for  e,  three  replications  of  the  associated  response  (one  per  variable),  and  another 
response  bit  for  the  clause  node. 

Claim:  The  constructed  architecture  can  perform  the  task  iff  the  SAT  instance 
is  satisfiable. 

Proof:  By  inspecting  item  1  and  item  3, 

/=(/^(0,0),/f(0,0),/,3(0,0),l)  =0 

/c(/^(0,l),/,2(0,l),/z3(0,l),l)  =  l 

Since  not  all  of  the  arguments  can  be  the  same,  conclude 


(? l)  or  <ft)  or  (ft)- 


Now  if  II  exists  then  let  (ft)  =  n(ft),  that  is,  let  /£(0,0)  —  0  and  f]w{ 0, 1)  n(ft) 

and  fl  (1, 1)  =  1  for  all  j,  or  more  definitively,  let 


OR  if  n(ft)  =  1  ,  _  j  AND  if  n(ft)  = 

and  if  n(ft)  =  o  an  \  or  if  n(ft)  =  o 


For  all  variables  ft  let  />  =  fl  -  fl  =  AND  and  /'  =  OR,  and  for  the  clause  node 
[et,  fc  =  OR.  The  items  are  all  performed  correctly. 


no 


a  b  d  e,  Ui  xx  yx  vx  u2  x2  y2  v2  u3  x3  y3  t/3  c 

item  1  :  (0  0  1  1,  0000  0000  0000  0) 

item  2;  (1110,  1111  1111  1111  1) 

item  3:  (0  10  1,  0010  0010  0010  1) 

Figure  C.2:  The  composed  construction  for  Theorem  24.  This  example  is 
for  the  single  clause  ft). 
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Conversely,  if  a  configuration  exist*,  let  FI (.:/,)  =  (£,-)  —  /^{0, 1)  ©  /£  (0,0),  and 
observe  (ft)  or  ($£}  or  (fs)  implies  ^  —  1  or  ^  =  I.  or  £3  =  1  as  required.  This  proves 
the  claim.  □ 

The  extension  to  multi-clause  systems  should  be  clear. 

Thus  we  have  SAT  a  PerfjnoFns  and  it  1S  easy  to  see  that  the  algorithm  for  the 
transformation  runs  in  polynomial  time. 

Finally,  as  argued  for  Theorem  1,  PerfsAFns  e  NP.  Hence  PerfsAFns  's  NP~ 
complete.  □ 

Recall  that  items  1  and  2  produced  equation  C.l  which  forced  a  relationship 
between  /„,(0,0)  and  /^(l,l)  given  by  equation  C.2.  From  items  2  and  3  we  know 

/»(/«.(!,  1)4)  =  1^0  =  /U(/1D(0,1),0), 

but  this  does  not  force  any  particular  relationship  between  /w(l,l)  and  fw{0, 1) 
(nor  do  items  1  and  3  force  any  relationship  between  /,u(0,0)  and  fw(0, 1)).  Hence 
fw(0 ,  1)  might  just  as  well  have  been  specified  as  a  ‘don’t-care’  as  it  was  in  the  proof 
for  Theorem  1.  Thus  input  d  and  node  u  have  been  employed  here  as  a  switch  to 
simulate  the  do-care/ don’t-care  distinction  for  the  output  from  node  w.  Similarly, 
d  and  v  have  been  used  for  z.  A  roughly  similar  technique  using  input  e  simulated 
the  do-care/ don’t-care  distinction  for  the  output  from  node  c. 

We  believe  these  techniques  could  be  applied  generally.  They  make  this  theorem 
stronger  at  the  expense  of  extra  complications  in  the  proof.  We  prefer  to  make  use 
of  the  *  in  our  other  theorems  in  order  to  simplify  their  proofs. 
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Appendix  D 


PROOF  FOR  PLANAR  CASE 
WITH  LSFNS 

The  proof  for  Theorem  16  used  LUFns  as  its  node  function  set,  and  hence  does 
not  cover  the  specific  (and  conventional)  case  of  node  function  sets  that  are  linearly 
separable.  This  appendix  gives  a  proof  that  is  strong  enough  to  cover  LSFns.  In 
particular,  we  give  a  construction  for  a  crossover  using  SAFns,  which  is  a  node  func¬ 
tion  set  described  in  Appendix  A.  Because  SAFns  C  LSFns  C  LUFns,  this  theorem 
is  sufficient  to  cover  the  linearly  separable  case  whereas  the  proof  for  Theorem  16 
was  not. 

Figure  D.l  is  the  construction  used  in  [Lic82,  Fig. 4]  as  a  crossover  box  in  his  proof 
of  NP- completeness  for  planar  SAT.  For  that  purpose  the  circles  were  interpreted 
as  variables  and  the  squares  as  clauses.  The  diagram  is  a  demonstration  that  the 
following  SAT  system  has  a  planar  layout: 


clauses  1-3: 

(a^V^Va)  (a2  V  a)  ( b2  V  a) 

i.e.  a2b2  <=>  a; 

clauses  4-6: 

{a^vbi  V0)(a2  v/3)(6i  V /?) 

i.e.  a2bi  /?; 

clauses  7-9: 

(ai  v  bi  v  7)  (07  v  7) (*h  v  l) 

i.e.  a^bi  7; 

clauses  10-12: 

(a!  V  b2  V  <5)(aT  V  <5)(£>2  v  6) 

i.e.  a[b2  <5; 

clause  13: 

(a  V  (3  V  7  V  6) 

clauses  14-1J: 

(a  V  /?)(/?  V  7) (7  V  <5)(<S  V  a) 

clauses  18-19: 

(a  V  a2)(a  V  a 2) 

i.e.  a  &  a2; 

clauses  20-21: 

(b  V  62)(6  V  b2) 

i.e.  b  <$•  b2; 

We  will  use  this  construction  in  the  proof  of  the  following  theorem: 
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0 


Figure  D.l:  Construction  from  Lichtenstein  for  Planar  SAT.  The  prototyp¬ 
ical  crossover  of  two  lines  shown  in  the  upper  right  is  replaced  by  the  much 
larger  construction,  which  provides  the  same  constraints  as  the  smaller  one 
would  have.  -  _ 
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Theorem  25  For  any  node  function  set  including  SAFns,  loading  is  NP-complete 
even  for  2-layered  architectures  viith  planar  SCI  graphs. 

Proof:  We  give  only  a  construction  for  a  ‘crossover  box’  which  can  be  used  to 
eliminate  one  crossing  of  connections  as  they  might  occur  in  the  proof  of  Theorem 
15. 

For  our  purpose  Figure  D.l  is  re-interpreted  as  the  plan  view  of  an  architecture — 
the  circles  being  first-layer  nodes  and  the  squares  being  second-layer  nodes.  To 
accompany  this  architecture,  a  task  is  constructed  to  mimic  the  effect  of  each  clause. 
For  this  we  use  the  techniques  of  constructing  items  that  are  used  in  the  proof  of 
Theorem  21.  First,  two  items  ensure  that  /,(0,0)  =  0  and  /,(1,1)  =  1  for  all 
variables  i  €  {a,  ai,  a2,  6, 6l5  b2,  a,  0, 7, 6}: 

a  ai  a2  b  bt  b2  a  0  1  6  a  at  a2  b  b2  a  0  7  <5  a  c2  c3  . . .  c2i 

00  00  00  00  00  00  00  00  00  00  00000000  00  ***  * 

11  11  11  11  11  11  11  11  11  11  ^  1  1  1  1  1  1  1  1 1 1  *  *  *  * 

Each  of  Lichtenstein’s  clauses  produces  2  items  as  in  the  following  example: 


a 

Uj 

a2 

b 

bi 

h 

a 

0 

1 

6 

Ci  &1 

bbl 

b2  a  0  7  <5  cx  c2  c3  . 

■  •  C2i 

** 

*  * 

11 

** 

** 

li 

00 

** 

** 

*  * 

l — >  *  *  * 

*  * 

*****0  *  * 

* 

** 

** 

01 

** 

** 

01 

01 

** 

** 

*  * 

i — >  *  *  * 

*  * 

*****1  *  * 

* 

This  corresponds  to  clause  1,  which  was  (a2  V  b2  V  a).  These  two  items  insure  that 

/ci(/a3(l>  1)’  1)>  /a(0>0))  =  M  ~  fci  {fa2  (0>  l)  j  fb2  (0>  l)  >  fa  (0»  l}) 

fCl  (1, 1,0)  =  0  ^  1  =  fCL  (/Ba( 0, 1),  fh  (0, 1),  fa( 0, 1)) 

fa2( 0, 1)^1  or  /h (0,1)  7^1  or  /a(0, 1)  /  0 

t  _ 

The  direct  correspondence  to  (a^  V  b2  V  a)  should  be  clear.  □ 
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